Automatic Segmentation, Labelling, and Characterisation of Audio Streams
Automatic Segmentation, Labelling, and Characterisation of Audio Streams
Disciplines
Electrical Engineering, Electronics, Information Engineering (10%); Computer Sciences (85%); Arts (5%)
Keywords
-
Music Information Retrieval (MIR),
Machine Learning,
Audio and Music Classification
The goal of this project is to develop technologies for the automatic segmentation and interpretation of audio files and audio streams deriving from different media worlds: music repositories, (Web and terrestrial) radio streams, TV broadcasts, etc. A specific focus is on streams in which music plays an important role. Specifically, the technologies to be developed should address the following tasks: (1) automatic segmentation (with or without meta-information) of audio streams into coherent or otherwise meaningful units or segments (based on general sound or rhythm similarity or homogeneity, on specific types of content and characteristics, on repeated occurrences of subsections, etc.); (2) the automatic categorisation of such audio segments into classes, and the association of segments and classes with meta-data derived from various sources (including the Web); (3) the automatic characterisation of audio segments and sound objects in terms of concepts intuitively understandable to humans. To this end, we plan to develop and/or improve and optimise computational methods that analyse audio streams, identify specific kinds of audio content (e.g., music, singing, speech, applause, commercials, ...), detect boundaries and transitions between songs, and classify musical and other segments into appropriate categories; that combine information from various sources (the audio signal itself, databases, the Internet) in order to refine the segmentation and gain meta-information; that automatically discover and optimise audio features that improve segmentation and classification; and that learn to derive comprehensible descriptions of audio contents from such audio features (via machine learning). The research is motivated by a large class of challenging applications in the media world that require efficient and robust audio segmentation and classification. Application scenarios include audio streaming services and Web stream analysis, automatic media monitoring, content- and descriptor-based search in large multimedia (audio) databases, and artistic applications. That there is a strong and very concrete demand for such methods is documented, among other things, by the fact that several companies from the media world have pledged to support this project with large amounts of real-world data and valuable meta-information.
of this project was to develop technologies for the automatic segmentation and interpretation of audio files and audio streams deriving from different media worlds: music repositories, radio streams, TV broadcasts, etc. A specific focus was to be placed on streams in which music plays an important role. For these domains, we have conducted fundamental research and developed commercial applications side by side. The key technology used in this project were Convolutional Neural Networks (CNNs), a relatively new and powerful tool in the domain of machine learning, which we were among the first to apply to music recordings. Specifically, we addressed the tasks of onset detection (detecting the starting point of any musical notes), music segmentation (detecting the boundaries between parts of a music piece), singing voice detection (detecting where in a music piece there are vocalizations), and beat annotation (detecting the metrical structure of a music piece). Our work served both as pioneering examples for other researchers, and demonstrated the versatility of training CNNs on spectral input, questioning the need for hand-designed features. While we have improved the state of the art in all tasks we considered, we obtained the most marked improvements for music segmentation, a key concern for this project. We could also show that for music segmentation and music similarity estimation, current state-of-the- art results have nearly reached an upper bound stemming from the ambiguity of the tasks or subjectivity of human judgements. In the quest of learning to categorize audio segments in the face of scarce ground-truth data, we have investigated data augmentation schemes for music recordings and learning from imprecise annotations ('weak labels'). We have pursued additional directions of highly application-driven research suitable for a translational project: We used deep learning to accelerate an existing music similarity measure to become applicable to commercial-scale collections, we improved music similarity estimation using a technique borrowed from speech processing, we developed a novel audio identification method robust to pitch and tempo changes, we developed methods for real-time singing voice detection as well as for real-time music, speech and applause detection. Several of our methods are already being employed or tested by commercial parties. We also participated in an international challenge of detecting bird calls in audio recordings. Our approach achieved the best results, showing that the methodology we used for music analysis also applies to more generic audio analysis.
Research Output
- 394 Citations
- 18 Publications
-
2016
Title The Problem of Limited Inter-rater Agreement in Modelling Music Similarity DOI 10.1080/09298215.2016.1200631 Type Journal Article Author Flexer A Journal Journal of New Music Research Pages 239-251 Link Publication -
2015
Title Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks. Type Conference Proceeding Abstract Author Grill T Conference Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), Malaga, Spain -
2015
Title Music Boundary Detection Using Neural Networks on Combined Features and Two-Level Annotations. Type Conference Proceeding Abstract Author Grill T Conference Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), Malaga, Spain -
2015
Title A Low-Latency, Real-Time-Capable Singing Voice Detection Method with Lstm Recurrent Neural Networks DOI 10.1109/eusipco.2015.7362337 Type Conference Proceeding Abstract Author Lehner B Pages 21-25 Link Publication -
2015
Title Music Boundary Detection Using Neural Networks on Spectrograms and Self-Similarity Lag Matrices DOI 10.1109/eusipco.2015.7362593 Type Conference Proceeding Abstract Author Grill T Pages 1296-1300 Link Publication -
2017
Title Two Convolutional Neural Networks for Bird Detection in Audio Signals DOI 10.23919/eusipco.2017.8081512 Type Conference Proceeding Abstract Author Grill T Pages 1764-1768 Link Publication -
2016
Title Learning To Pinpoint Singing Voice From Weakly Labeled Examples. DOI 10.5281/zenodo.1417650 Type Other Author Schlüter J Link Publication -
2016
Title Learning To Pinpoint Singing Voice From Weakly Labeled Examples. DOI 10.5281/zenodo.1417651 Type Other Author Schlüter J Link Publication -
2016
Title Learning to Pinpoint Singing Voice from Weakly Labeled Examples. Type Conference Proceeding Abstract Author Schlüter J Conference Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), New York, USA -
2014
Title On the Reduction of False Positives in Singing Voice Detection DOI 10.1109/icassp.2014.6855054 Type Conference Proceeding Abstract Author Lehner B Pages 7480-7484 -
2014
Title Improved Musical Onset Detection with Convolutional Neural Networks DOI 10.1109/icassp.2014.6854953 Type Conference Proceeding Abstract Author Schlüter J Pages 6979-6983 -
2014
Title On World Construction, Variation: Duoddaris. Type Conference Proceeding Abstract Author Grill T Conference Proceedings of the Second conference on Computation, Communication, Aesthetics and X (xCoax), Porto, Portugal -
2015
Title A Low-Latency, Real-Time-Capable Singing Voice Detection Method With Lstm Recurrent Neural Networks DOI 10.5281/zenodo.38849 Type Other Author Böck S Link Publication -
2015
Title Robust Quad-Based Audio Fingerprinting DOI 10.1109/taslp.2015.2509248 Type Journal Article Author Sonnleitner R Journal IEEE/ACM Transactions on Audio, Speech, and Language Processing Pages 409-421 -
2013
Title Musical Onset Detection with Convolutional Neural Networks. Type Conference Proceeding Abstract Author Böck S Conference 6th International Workshop on Machine Learning and Music (MML) in conjunction with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD), Prague, Czech Republic -
2013
Title Learning Binary Codes for Efficient Large-Scale Music Similarity Search. Type Conference Proceeding Abstract Author Schlüter J Conference Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR), Curitiba, Brazil -
2015
Title Improving Voice Activity Detection in Movies. Type Conference Proceeding Abstract Author Lehner B Conference Proceedings of the 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), Dresden, Germany. -
2014
Title Boundary Detection in Music Structure Analysis using Convolutional Neural Networks. Type Conference Proceeding Abstract Author Grill T Et Al Conference Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan