Projectdetail

Disciplines

Computer Sciences (85%); Linguistics and Literature (15%)

Keywords

Speech Synthesis, Visual Synthesis, Dialect

Abstract

Final report

The goal of this project is to investigate multimodal adaptation for audio-visual speech synthesis. Human speech is multimodal and therefore we aim at modeling both the audio and visual signals jointly. Furthermore, in speech behavior we are confronted with intra-speaker variability (e.g. variability in dependence on different speech situations, speaking tasks or emotional states of the speaker) and inter-speaker variability (e.g. variability across sociolects and/or dialects). The second type of variation can be modeled by adapting average models of speakers with different dialects to a speaker of a specific dialect. Dialect is chosen as a source of variation between speakers to extend our previous work on Viennese sociolects to other Austrian dialects and to conduct basic research on the audio-visual synthesis of dialects. Generally, audio-visual speech synthesis is the attempt to generate speech and visual signals of a person speaking. In most previous approaches the acoustic and visual signal were modeled separately although both signals are the result of the same underlying articulation process and should be treated as one. Moreover, adding visual information to the synthesis models might lead to better overall acoustic synthesis. Therefore, we propose a joint audio visual modeling framework that is able to generate both acoustic and visual speech for different Austrian dialects. By employing hidden Markov models (HMMs) for both audio and visual speech synthesis, it is possible to combine these two feature streams into a single model. Therefore, a major aspect of this project will be the multimodal adaptation of audio-visual synthesis models. The joint adaptation of audio and visual models from multimodal audio-visual models has not yet been investigated and leads to several important research questions that we want to address in this project.

Generally, audio-visual speech synthesis is the attempt to generate speech and visual signals of a person speaking. Audio-visual synthesis can be used in communication technologies and computer games. In this project we investigated multimodal modelling for audio-visual dialect speech synthesis. Human speech is multimodal and therefore we modeled both the audio and visual signals jointly. In most previous approaches the acoustic and visual signal were modeled separately although both signals are the result of the same underlying articulation process and should be treated as one.In this project we could show that joint modeling of visual and acoustic signals can lead to better visual synthesis, without changing the quality of the acoustic synthesis. Through the use of flexible models that can be adapted through parameters, these models can be easily reused and transformed.Furthermore we could show that adaptation of visual average models with new data can improve modeling compared to models that dont use any background data. With this method it is possible to train a visual model of a person with a small amount of adaptation data.For controlling acoustic models with a large number of parameters, we developed a method that allows for changing acoustic parameters via visual parameters. The opening of the mouth in the visual model can thus lead to the corresponding acoustic changes in the acoustic model.For modeling of dialects we did extensive recordings of two Austrian dialects with 8 speakers, one Middle Bavarian dialect from Upper Austria (Bad Goisern) and one South Bavarian dialect from Tyrol (Innervillgraten). For these audio-visual dialect recordings for speech synthesis, we developed a method for phonetic data collection and audio-visual recording. For modeling of dialects we developed methods for optimal use of dialect data. The recorded data is already used in other running projects and will also lead to new findings in the future.

Research institution(s)

Project participants

Sylvia Moosmüller, Österreichische Akademie der Wissenschaften , associated research partner

Research Output

233 Citations
14 Publications

Publications

Title	Phone set selection for HMM-based dialect speech Synthesis.
Type	Conference Proceeding Abstract
Author	Pucher M

Title	DETECTION OF SYNTHETIC SPEECH FOR THE PROBLEM OF IMPOSTURE
DOI	10.1109/icassp.2011.5947440
Type	Conference Proceeding Abstract
Author	De Leon P
Pages	4844-4847
Link	Publication

Title	Regionalizing Virtual Avatars - Towards Adaptive Audio-Visual Dialect Speech Synthesis.
Type	Conference Proceeding Abstract
Author	Moosmüller S Et Al
Conference	In Proc. 5th International Conference on Cognitive Systems, Vienna, Austria, 2012

Title	Evaluation of Speaker Verification Security and Detection of HMM-Based Synthetic Speech
DOI	10.1109/tasl.2012.2201472
Type	Journal Article
Author	De Leon P
Journal	IEEE Transactions on Audio, Speech, and Language Processing
Pages	2280-2290
Link	Publication

Title	Sprachressourcen für adaptive Sprachsynthesen von Dialekten.
Type	Conference Proceeding Abstract
Author	Hofer G Et Al
Conference	SIDG 2012

Title	Speaker-adaptive visual speech synthesis in the HMM-Framework.
Type	Conference Proceeding Abstract
Author	Hofer G Et Al

Title	From Viennese to Austrian German and back again-An alogorithm for the realization of a variety-slider.
Type	Conference Proceeding Abstract
Author	Hofer G Et Al
Conference	SIDG 2012

Title	The MMASCS multi-modal annotated synchronous corpus of audio, video, facial motion and tongue motion data of normal, fast and slow speech.
Type	Conference Proceeding Abstract
Author	Hoole P Et Al
Conference	LREC 2014

Title	Proceedings Abstract Book.
Type	Other
Author	Pucher M

Title	Building a synchronous corpus of acoustic and 3D facial marker data for adaptive audiovisual speech Synthesis.
Type	Conference Proceeding Abstract
Author	Hofer G Et Al
Conference	LREC 2012

Title	Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis
DOI	10.1016/j.specom.2015.06.005
Type	Journal Article
Author	Toman M
Journal	Speech Communication
Pages	176-193
Link	Publication

Title	Visual Control of Hidden-Semi-Markov-Model based Acoustic Speech Synthesis.
Type	Conference Proceeding Abstract
Author	Hollenstein J
Conference	AVSP 2013

Title	Objective and Subjective Feature Evaluation for Speaker-Adaptive Visual Speech Synthesis.
Type	Conference Proceeding Abstract
Author	Hofer G Et Al
Conference	AVSP 2013

Title	Joint Audiovisual Hidden Semi-Markov Model-Based Speech Synthesis
DOI	10.1109/jstsp.2013.2281036
Type	Journal Article
Author	Schabus D
Journal	IEEE Journal of Selected Topics in Signal Processing
Pages	336-347
Link	Publication

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

Adaptive Audio-Visual Dialect Speech Synthesis

Adaptive Audio-Visual Dialect Speech Synthesis

Disciplines

Keywords

Research Output

Contact

General information

Go to overview page Discover

Go to overview page Funding

Go to overview page About Us

Go to overview page News

SOCIAL MEDIA

SCILOG

Adaptive Audio-Visual Dialect Speech Synthesis

Adaptive Audio-Visual Dialect Speech Synthesis

Disciplines

Keywords

Research Output