Perceptual Optimization of Time-Frequency Audio Representations and Coding
Perceptual Optimization of Time-Frequency Audio Representations and Coding
Bilaterale Ausschreibung: Frankreich
Disciplines
Electrical Engineering, Electronics, Information Engineering (50%); Mathematics (20%); Psychology (30%)
Keywords
-
Auditory Masking,
Efficiency,
Time-Frequency Representations,
Gabor,
Audio Coding
One of the greatest challenges in signal processing is to develop efficient signal representations. Such a representation extracts relevant information and describes it with a minimal amount of data. In the specific context of sound processing, and especially in audio coding, where the goal is to minimize the size of binary data required for storage or transmission, it is desirable that the representation takes into account human auditory perception and allows reconstruction with a controlled amount of perceived distortion. Over the last decades, many psychoacoustical studies investigated auditory masking, an important property of auditory perception. Masking refers to the degradation of the detection threshold of a sound in presence of another sound. The results were used to develop models of either spectral or temporal masking. Attempts were made to simply combine these models to account for time-frequency (t-f) masking in perceptual audio codecs. We recently conducted psychoacoustical studies on t-f masking which revealed the inaccuracy of such simple models. These new data represent a crucial basis to account for masking effects in t-f representations of sounds. Although t-f representations are standard tools in audio processing, the development of a representation that is mathematically-founded, perception-based, perfectly invertible, and possibly with a minimum amount of redundancy, remains a challenge. POTION addresses the following main questions: To what extent is it possible to obtain a perception-based (i.e., as close as possible to "what we get is what we hear"), perfectly invertible, and possibly minimally redundant t-f representation of sound signals? Such a representation is essential for modeling complex masking interactions in the t-f domain and is expected to improve our understanding of auditory sound processing. Moreover, it is of fundamental interest for many audio applications involving sound analysis-synthesis. Is it possible to improve current perceptual audio codecs by considering a joint t-f approach? To reduce the size of digital audio files, perceptual audio codecs like MP3 apply a frequency transform and use spectral masking models to control the sub-quantization of transform coefficients. Thus, current codecs follow mainly a spectral approach, although temporal masking effects are taken into account in some implementations. By combining an efficient perception-based t-f transform with a joint t-f masking model in an audio codec, we expect to achieve significant performance improvements. To investigate these issues, a multidisciplinary approach is required. Accordingly, POTION is based on a consortium involving the Laboratory for Mechanics and Acoustics (LMA, France) and the Acoustics Research Institute (ARI, Austria). The LMA features international experts in signal processing methods for analysis- synthesis of non-stationary audio signals and audio coding. The ARI features international experts in mathematics, t-f analysis, and psychoacoustics. By establishing strong interactions between the two institutions and disciplines, the members of POTION represent an optimum consortium to successfully achieve these goals.
The fundamental research in POTION aimed at developing new methods for the representation and interpretation of audio signals. An efficient representation extracts relevant information and describes it with a minimal amount of data. In the specific context of sound processing, and especially in audio coding, where the goal is to minimize the size of binary data required for storage or transmission, it is desirable that the representation takes into account human auditory perception and allows reconstruction with a controlled amount of perceived distortion. The main goal was to obtain a perceptually optimized representation, i.e., which displays only the significantly audible components of sound signals. To achieve this goal, the research in POTION focused on both time-frequency (TF) analysis methods and psychoa coustics.TF representations are standard tools in audio processing. They allow to display the temporal evolution (x-coordinates) of each spectral component (y-coordinates) of a signal as an image. The temporal and spectral resolution of the image depend on the mathematical proper- ties and implementation of the representation. Currently, there is no TF representation avail - able that mimics the auditory TF resolution and allows perfect reconstruction. In POTION, such a representation was developed: the framework Audlet provides a versatile and efficient filter bank design for the analysis and synthesis of audio signal using auditory frequency scales. It is highly suitable for audio applications requiring stability, perfect reconstruction, and a flexible choice of redundancy.To obtain a perceptually optimized TF representation, it was necessary to investigate auditory masking. Masking refers to the degradation of the detection threshold of a sound in presence of another sound. Over the last decades, many psycho-acoustical studies investigated masking. Their results were used to develop models of either spectral or temporal masking. At- tempts were made to simply combine these models to account for TF masking. However, preliminary TF masking data collected before the project begin revealed the inaccuracy of such simple models. To propose an accurate model of TF masking, additional masking data were collected in POTION. These data were implemented in a filter to display only audible components of the Audlet representation. Moreover, psychoacoustics experiments conducted in POTION contributed new methods and data on the measurements of cochlear compression in humans.Another research question in POTION was: Is it possible to improve the performance of lossy coding algorithms? To reduce the size of digital audio files, perceptual audio codecs like MP3 decompose sounds into variable-length time segments, apply a frequency transform, and use masking models to control the sub-quantization of transform coefficients within each segment. Thus, current codecs follow mainly a spectral approach. By combining an efficient perception-based TF transform with a joint TF masking model in an audio codec, it was expected to achieve significant performance improvements. A variant of the Audlet was developed and adapted to audio coding: the ERB-MDCT. A lossy coder/decoder was then implemented. It combines the ERB-MDCT and a sparse decomposition algorithm that uses the TF model developed in the project. This coder was optimized for low bitrates (24-48 kbps), and challenges state-of-the art codecs (AAC HEv2).
- Martin Ehler, Universität Wien , associated research partner
- Olivier Derrien, Centre National de Recherche Scientifique (CNRS) - France
Research Output
- 123 Citations
- 7 Publications
-
2015
Title A Quasi-Orthogonal, Invertible, and Perceptually Relevant Time-Frequency Transform for Audio Coding DOI 10.1109/eusipco.2015.7362493 Type Conference Proceeding Abstract Author Derrien O Pages 799-803 Link Publication -
2018
Title Audlet Filter Banks: A Versatile Analysis/Synthesis Framework Using Auditory Frequency Scales DOI 10.3390/app8010096 Type Journal Article Author Necciari T Journal Applied Sciences Pages 96 Link Publication -
2017
Title Frame Theory for Signal Processing in Psychoacoustics DOI 10.1007/978-3-319-54711-4_10 Type Book Chapter Author Balazs P Publisher Springer Nature Pages 225-268 -
2016
Title Auditory Time-Frequency Masking for Spectrally and Temporally Maximally-Compact Stimuli DOI 10.1371/journal.pone.0166937 Type Journal Article Author Necciari T Journal PLOS ONE Link Publication -
2016
Title The role of compression in the simultaneous masker phase effecta) DOI 10.1121/1.4964328 Type Journal Article Author Tabuchi H Journal The Journal of the Acoustical Society of America Pages 2680-2694 Link Publication -
2013
Title THE ERBLET TRANSFORM: AN AUDITORY-BASED TIME-FREQUENCY REPRESENTATION WITH PERFECT RECONSTRUCTION DOI 10.1109/icassp.2013.6637697 Type Conference Proceeding Abstract Author Necciari T Pages 498-502 -
2014
Title Perceptual Matching Pursuit with Gabor Dictionaries and Time-Frequency Masking DOI 10.1109/icassp.2014.6854171 Type Conference Proceeding Abstract Author Chardon G Pages 3102-3106