Adaptive Audio-Visual Dialect Speech Synthesis
Adaptive Audio-Visual Dialect Speech Synthesis
Disciplines
Computer Sciences (85%); Linguistics and Literature (15%)
Keywords
-
Speech Synthesis,
Visual Synthesis,
Dialect
The goal of this project is to investigate multimodal adaptation for audio-visual speech synthesis. Human speech is multimodal and therefore we aim at modeling both the audio and visual signals jointly. Furthermore, in speech behavior we are confronted with intra-speaker variability (e.g. variability in dependence on different speech situations, speaking tasks or emotional states of the speaker) and inter-speaker variability (e.g. variability across sociolects and/or dialects). The second type of variation can be modeled by adapting average models of speakers with different dialects to a speaker of a specific dialect. Dialect is chosen as a source of variation between speakers to extend our previous work on Viennese sociolects to other Austrian dialects and to conduct basic research on the audio-visual synthesis of dialects. Generally, audio-visual speech synthesis is the attempt to generate speech and visual signals of a person speaking. In most previous approaches the acoustic and visual signal were modeled separately although both signals are the result of the same underlying articulation process and should be treated as one. Moreover, adding visual information to the synthesis models might lead to better overall acoustic synthesis. Therefore, we propose a joint audio visual modeling framework that is able to generate both acoustic and visual speech for different Austrian dialects. By employing hidden Markov models (HMMs) for both audio and visual speech synthesis, it is possible to combine these two feature streams into a single model. Therefore, a major aspect of this project will be the multimodal adaptation of audio-visual synthesis models. The joint adaptation of audio and visual models from multimodal audio-visual models has not yet been investigated and leads to several important research questions that we want to address in this project.
Generally, audio-visual speech synthesis is the attempt to generate speech and visual signals of a person speaking. Audio-visual synthesis can be used in communication technologies and computer games. In this project we investigated multimodal modelling for audio-visual dialect speech synthesis. Human speech is multimodal and therefore we modeled both the audio and visual signals jointly. In most previous approaches the acoustic and visual signal were modeled separately although both signals are the result of the same underlying articulation process and should be treated as one.In this project we could show that joint modeling of visual and acoustic signals can lead to better visual synthesis, without changing the quality of the acoustic synthesis. Through the use of flexible models that can be adapted through parameters, these models can be easily reused and transformed.Furthermore we could show that adaptation of visual average models with new data can improve modeling compared to models that dont use any background data. With this method it is possible to train a visual model of a person with a small amount of adaptation data.For controlling acoustic models with a large number of parameters, we developed a method that allows for changing acoustic parameters via visual parameters. The opening of the mouth in the visual model can thus lead to the corresponding acoustic changes in the acoustic model.For modeling of dialects we did extensive recordings of two Austrian dialects with 8 speakers, one Middle Bavarian dialect from Upper Austria (Bad Goisern) and one South Bavarian dialect from Tyrol (Innervillgraten). For these audio-visual dialect recordings for speech synthesis, we developed a method for phonetic data collection and audio-visual recording. For modeling of dialects we developed methods for optimal use of dialect data. The recorded data is already used in other running projects and will also lead to new findings in the future.
- Sylvia Moosmüller, Österreichische Akademie der Wissenschaften , associated research partner
Research Output
- 233 Citations
- 14 Publications
-
2011
Title Phone set selection for HMM-based dialect speech Synthesis. Type Conference Proceeding Abstract Author Pucher M -
2014
Title The MMASCS multi-modal annotated synchronous corpus of audio, video, facial motion and tongue motion data of normal, fast and slow speech. Type Conference Proceeding Abstract Author Hoole P Et Al Conference LREC 2014 -
2013
Title Joint Audiovisual Hidden Semi-Markov Model-Based Speech Synthesis DOI 10.1109/jstsp.2013.2281036 Type Journal Article Author Schabus D Journal IEEE Journal of Selected Topics in Signal Processing Pages 336-347 Link Publication -
2013
Title Objective and Subjective Feature Evaluation for Speaker-Adaptive Visual Speech Synthesis. Type Conference Proceeding Abstract Author Hofer G Et Al Conference AVSP 2013 -
2013
Title Visual Control of Hidden-Semi-Markov-Model based Acoustic Speech Synthesis. Type Conference Proceeding Abstract Author Hollenstein J Conference AVSP 2013 -
2012
Title From Viennese to Austrian German and back again-An alogorithm for the realization of a variety-slider. Type Conference Proceeding Abstract Author Hofer G Et Al Conference SIDG 2012 -
2012
Title Building a synchronous corpus of acoustic and 3D facial marker data for adaptive audiovisual speech Synthesis. Type Conference Proceeding Abstract Author Hofer G Et Al Conference LREC 2012 -
2012
Title Sprachressourcen für adaptive Sprachsynthesen von Dialekten. Type Conference Proceeding Abstract Author Hofer G Et Al Conference SIDG 2012 -
2012
Title Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech DOI 10.1109/tasl.2012.2201472 Type Journal Article Author De Leon P Journal IEEE Transactions on Audio, Speech, and Language Processing Pages 2280-2290 Link Publication -
2011
Title DETECTION OF SYNTHETIC SPEECH FOR THE PROBLEM OF IMPOSTURE DOI 10.1109/icassp.2011.5947440 Type Conference Proceeding Abstract Author De Leon P Pages 4844-4847 Link Publication -
2012
Title Speaker-adaptive visual speech synthesis in the HMM-Framework. Type Conference Proceeding Abstract Author Hofer G Et Al -
2012
Title Regionalizing Virtual Avatars - Towards Adaptive Audio-Visual Dialect Speech Synthesis. Type Conference Proceeding Abstract Author Moosmüller S Et Al Conference In Proc. 5th International Conference on Cognitive Systems, Vienna, Austria, 2012 -
2015
Title Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis DOI 10.1016/j.specom.2015.06.005 Type Journal Article Author Toman M Journal Speech Communication Pages 176-193 Link Publication -
0
Title Proceedings Abstract Book. Type Other Author Pucher M