A Corpus Based Investigation into Segmental Duration in German Speech
A Corpus Based Investigation into Segmental Duration in German Speech
Disciplines
Computer Sciences (85%); Linguistics and Literature (15%)
Keywords
-
SPRACHSYNTHESE,
PROSODIE,
COMPUTERLINGUISTIK,
LANGUAGE ENGINEERING,
ARTIFICIAL INTELLIGENCE
Automatic speech synthesis is a highly promising field. of growing economic importance. Naturalsounding speech is a key factor for the acceptability of practical voice output systems whereby the main factors contributing to naturalness are segmental quality and prosody. Current improvements in the segmental quality of synthesized speech have made it clear that truly high-quality speech synthesis now depends crucially on adequate and natural- sounding prosody as well. Besides, new application areas going beyond text-to-speech like spoken dialogue and concept-to-speech systems ask for the production of utterances with non-neutral prosody. We propose a project for the investigation of segmental duration in German speech. At the moment, most research in prosody is directed at intonation and its realization through fundamental frequency (f0) as the single most important factor in prosody, while duration (and amplitude) are regarded as secondary (dependent) parameters. A more thorough investigation of duration and its interaction with f0 is necessary before -if at all - such a conclusion can be drawn. The goal of the project is a better model of segmental duration, and also its interaction with other parameters like f0, where the currently prevalent methods are not satisfactory for users in terms of naturalness. Moreover, we want to get a clearer understanding of the relation between discourse structure and prosodic parameters. New application areas make rich linguistic information available to speech production for the first time. To take advantage of this information, the realization of these features by prosodic parameters has to be understood. Central to our approach will be the investigation of the interdependencies between intonation and duration, i.e., fundamental frequency will explicitly be taken into account by means of tone labeling. Another novelty is the explicit incorporation of discourse related information such as the division of topic, focus and background in a dedicated part of the corpus. When investigating prosody we have to take into account a large number of (potential) parameters without the possibility to recur to an agreed-upon linguistic theory covering the whole range of phenomena. Moreover, we cannot apriori exclude non-linear dependencies between these parameters. In such a situation a data-driven, statistical approach seems to be appropriate. To investigate segmental duration in that paradigm we need speech corpora of adequate size with prosodic labeling. Because of the number of influencing factors to be considered, we will need a corpus of considerable size (50.000+ phonemes). The corpus used in our study will be the first corpus of that kind of Austrian German. The construction of such a speech database not only is an indispensable prerequisite for our study on duration but also will be of interest to other researchers who want to perform phonetical investigations on that variant of German. A demand for such a corpus exists, both in academic and industrial research. The statistical methods selected must be able to cope with the inherently uneven distribution of feature values (data sparsity). Some forms of neural networks have proved to be suitable for the task of predicting segmental duration. Their disadvantage for our purpose is that it is inherently difficult to interpret results achieved. Therefore, we have opted for the use of machine learning methods, in particular Structural Regression Trees (SRT). SRT integrates the statistical method of regression trees with the inductive logic programming paradigm. It is a flexible machine learning paradigm that allows for the use of relational constraints and is well suited for numerical problems. It also fulfills the requirement for producing inspectable results. The results of our study shall be integrated into our existing speech synthesis component. This will provide us with the necessary tool to experimentally test our hypotheses in the evaluation phase. It shall also showcase the practical enhancement of the quality of synthesized speech through the employment of the project`s results. The proposed project shall also form the Austrian contribution to COST action 258 "Naturalness of synthetic Speech". This action comprises research laboratories from 14 European countries. The aim of the action is to develop methods to increase the naturalness of synthetic speech which is a prerequisite for its broad application in commercial applications.
Naturalness is the key factor for the acceptance and comprehensibility of automatically synthesised speech. One of the most important parameters to control is the duration of speech segments. In order to be able to predict the duration of speech sounds it is necessary to analyse data of actual speech by means of statistical methods. For this purpose, an adequately large corpus of Austrian German was established. We recorded one speaker and segmented and annotated the speech signal. Using machine learning techniques it was possible to achieve appropriate duration models. Their quality was checked against methods in the literature. For the first time we have now a model for the automatic synthesis of the Austrian variant of German. The following factors have to be controlled in order to synthesise speech in an unlimited and natural way: intensity, pitch, and - maybe most important - the duration of particular acoustic events. This is independent of the method for the generation of speech signals, be it the simulation of the characteristics of the sound (formant synthesis), the derivation from production models (articulatory synthesis), or the concatenation of pre-recorded parts of speech (concatenative synthesis). The core problem to the modelling of duration is the fact that the speech signal functions as a carrier for a variety of information, which are only communicable together. The speaker must agglomerate this information, and the hearer must extract the individual components from the complex signal. Non-linguistic information are for example the emotional state of a speaker. Influencing factors are speaker characteristics or a certain speaking style. Additionally, the make-up of the utterance by various phrases is encoded, as well as accenting. Language specific factors come from the linguistic structuring: from the sentence level (syntax) to the level of phonemes (syllable structure). Beside that there are also genuine phonetic factors such as the mutual influence of neighbouring phonemes which affect the duration of single segments. How can we approach this complex task? Either one postulates a set of rules which result in a duration value for each phoneme, or one tries to simulate natural speech using statistical methods. In this project the second approach was favoured. In order to do so, it was necessary to establish a corpus of spoken speech large and combinatory rich enough for machine learning techniques to provide valid results. Potential influencing factors had to be controlled (we recorded speech in reading style of one single speaker of Austrian Standard German from Vienna). Or the factors had to be determined, if they were taken to have some influence on duration (for example phrasing, accent, syllable structure, neighbouring segments). On top of that it was necessary to segment the signal into individual sounds, in order to obtain reference values for duration. In this corpus the number of phonetic segments is approx. 50.000, which have been at least corrected manually. In a last step the data were used to generate various models using statistical machine learning techniques. These models predict for each phoneme in every potential context a duration value. For optimisation we experimented with various factors and also tested several techniques. The quality of the results as good as the best methods reported in the literature. Maybe the most significant result of this project is that for the first time a model is available for the synthesis of Austrian German.
- Gernot Kubin, Technische Universität Graz , associated research partner
- Gzregorz Dogil, Universität Stuttgart-Hohenheim - Germany