Projectdetail

Grant DOI 10.55776/I1362
Funding program Principal Investigator Projects International
Status ended
Start March 1, 2014
End October 31, 2017
Funding amount € 237,174
Project website

Bilaterale Ausschreibung: Frankreich

Disciplines

Electrical Engineering, Electronics, Information Engineering (50%); Mathematics (20%); Psychology (30%)

Keywords

Auditory Masking, Efficiency, Time-Frequency Representations, Gabor, Audio Coding

Abstract

Final report

One of the greatest challenges in signal processing is to develop efficient signal representations. Such a representation extracts relevant information and describes it with a minimal amount of data. In the specific context of sound processing, and especially in audio coding, where the goal is to minimize the size of binary data required for storage or transmission, it is desirable that the representation takes into account human auditory perception and allows reconstruction with a controlled amount of perceived distortion. Over the last decades, many psychoacoustical studies investigated auditory masking, an important property of auditory perception. Masking refers to the degradation of the detection threshold of a sound in presence of another sound. The results were used to develop models of either spectral or temporal masking. Attempts were made to simply combine these models to account for time-frequency (t-f) masking in perceptual audio codecs. We recently conducted psychoacoustical studies on t-f masking which revealed the inaccuracy of such simple models. These new data represent a crucial basis to account for masking effects in t-f representations of sounds. Although t-f representations are standard tools in audio processing, the development of a representation that is mathematically-founded, perception-based, perfectly invertible, and possibly with a minimum amount of redundancy, remains a challenge. POTION addresses the following main questions: To what extent is it possible to obtain a perception-based (i.e., as close as possible to "what we get is what we hear"), perfectly invertible, and possibly minimally redundant t-f representation of sound signals? Such a representation is essential for modeling complex masking interactions in the t-f domain and is expected to improve our understanding of auditory sound processing. Moreover, it is of fundamental interest for many audio applications involving sound analysis-synthesis. Is it possible to improve current perceptual audio codecs by considering a joint t-f approach? To reduce the size of digital audio files, perceptual audio codecs like MP3 apply a frequency transform and use spectral masking models to control the sub-quantization of transform coefficients. Thus, current codecs follow mainly a spectral approach, although temporal masking effects are taken into account in some implementations. By combining an efficient perception-based t-f transform with a joint t-f masking model in an audio codec, we expect to achieve significant performance improvements. To investigate these issues, a multidisciplinary approach is required. Accordingly, POTION is based on a consortium involving the Laboratory for Mechanics and Acoustics (LMA, France) and the Acoustics Research Institute (ARI, Austria). The LMA features international experts in signal processing methods for analysis- synthesis of non-stationary audio signals and audio coding. The ARI features international experts in mathematics, t-f analysis, and psychoacoustics. By establishing strong interactions between the two institutions and disciplines, the members of POTION represent an optimum consortium to successfully achieve these goals.

The fundamental research in POTION aimed at developing new methods for the representation and interpretation of audio signals. An efficient representation extracts relevant information and describes it with a minimal amount of data. In the specific context of sound processing, and especially in audio coding, where the goal is to minimize the size of binary data required for storage or transmission, it is desirable that the representation takes into account human auditory perception and allows reconstruction with a controlled amount of perceived distortion. The main goal was to obtain a perceptually optimized representation, i.e., which displays only the significantly audible components of sound signals. To achieve this goal, the research in POTION focused on both time-frequency (TF) analysis methods and psychoa coustics.TF representations are standard tools in audio processing. They allow to display the temporal evolution (x-coordinates) of each spectral component (y-coordinates) of a signal as an image. The temporal and spectral resolution of the image depend on the mathematical proper- ties and implementation of the representation. Currently, there is no TF representation avail - able that mimics the auditory TF resolution and allows perfect reconstruction. In POTION, such a representation was developed: the framework Audlet provides a versatile and efficient filter bank design for the analysis and synthesis of audio signal using auditory frequency scales. It is highly suitable for audio applications requiring stability, perfect reconstruction, and a flexible choice of redundancy.To obtain a perceptually optimized TF representation, it was necessary to investigate auditory masking. Masking refers to the degradation of the detection threshold of a sound in presence of another sound. Over the last decades, many psycho-acoustical studies investigated masking. Their results were used to develop models of either spectral or temporal masking. At- tempts were made to simply combine these models to account for TF masking. However, preliminary TF masking data collected before the project begin revealed the inaccuracy of such simple models. To propose an accurate model of TF masking, additional masking data were collected in POTION. These data were implemented in a filter to display only audible components of the Audlet representation. Moreover, psychoacoustics experiments conducted in POTION contributed new methods and data on the measurements of cochlear compression in humans.Another research question in POTION was: Is it possible to improve the performance of lossy coding algorithms? To reduce the size of digital audio files, perceptual audio codecs like MP3 decompose sounds into variable-length time segments, apply a frequency transform, and use masking models to control the sub-quantization of transform coefficients within each segment. Thus, current codecs follow mainly a spectral approach. By combining an efficient perception-based TF transform with a joint TF masking model in an audio codec, it was expected to achieve significant performance improvements. A variant of the Audlet was developed and adapted to audio coding: the ERB-MDCT. A lossy coder/decoder was then implemented. It combines the ERB-MDCT and a sparse decomposition algorithm that uses the TF model developed in the project. This coder was optimized for low bitrates (24-48 kbps), and challenges state-of-the art codecs (AAC HEv2).

Research institution(s)

Österreichische Akademie der Wissenschaften - 86%
Universität Wien - 14%

Project participants

Martin Ehler, Universität Wien , associated research partner

International project participants

Olivier Derrien, Centre National de Recherche Scientifique (CNRS) - France

Research Output

123 Citations
7 Publications

Publications

Title	A Quasi-Orthogonal, Invertible, and Perceptually Relevant Time-Frequency Transform for Audio Coding
DOI	10.1109/eusipco.2015.7362493
Type	Conference Proceeding Abstract
Author	Derrien O
Pages	799-803
Link	Publication

Title	Audlet Filter Banks: A Versatile Analysis/Synthesis Framework Using Auditory Frequency Scales
DOI	10.3390/app8010096
Type	Journal Article
Author	Necciari T
Journal	Applied Sciences
Pages	96
Link	Publication

Title	Frame Theory for Signal Processing in Psychoacoustics
DOI	10.1007/978-3-319-54711-4_10
Type	Book Chapter
Author	Balazs P
Publisher	Springer Nature
Pages	225-268

Title	Auditory Time-Frequency Masking for Spectrally and Temporally Maximally-Compact Stimuli
DOI	10.1371/journal.pone.0166937
Type	Journal Article
Author	Necciari T
Journal	PLOS ONE
Link	Publication

Title	The role of compression in the simultaneous masker phase effecta)
DOI	10.1121/1.4964328
Type	Journal Article
Author	Tabuchi H
Journal	The Journal of the Acoustical Society of America
Pages	2680-2694
Link	Publication

Title	THE ERBLET TRANSFORM: AN AUDITORY-BASED TIME-FREQUENCY REPRESENTATION WITH PERFECT RECONSTRUCTION
DOI	10.1109/icassp.2013.6637697
Type	Conference Proceeding Abstract
Author	Necciari T
Pages	498-502

Title	Perceptual Matching Pursuit with Gabor Dictionaries and Time-Frequency Masking
DOI	10.1109/icassp.2014.6854171
Type	Conference Proceeding Abstract
Author	Chardon G
Pages	3102-3106

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

Perceptual Optimization of Time-Frequency Audio Representations and Coding

Perceptual Optimization of Time-Frequency Audio Representations and Coding

Disciplines

Keywords

Research Output

Contact

General information

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

SOCIAL MEDIA

SCILOG

Perceptual Optimization of Time-Frequency Audio Representations and Coding

Perceptual Optimization of Time-Frequency Audio Representations and Coding

Disciplines

Keywords

Research Output