Cross-layer pronunciation modeling for conversational speech
Cross-layer pronunciation modeling for conversational speech
Disciplines
Computer Sciences (40%); Linguistics and Literature (60%)
Keywords
-
Automatic Speech Recognition,
Spontaneous Speech,
Pronunciation Variation,
Austrian German,
Linguistic Models,
Dutch
ASR systems have originally been designed to cope with carefully pronounced speech. As a consequence, these systems cannot deal well with spontaneous, conversational speech. Read and conversational speech are different in many aspects. On the linguistic level, conversational speech contains disfluencies and many utterances that might be considered as `ungrammatical`. On the phonetic level, a much higher degree of pronunciation variation is observed in spontaneous than in read speech. Words are more often acoustically reduced compared to their full pronunciations, such that a word like yesterday may sound like yeshay or a German word like haben my sound like ham. Since most real world applications of ASR systems require the recognition of spontaneous speech (e.g., dialogue systems, voice input aids for physically disabled, medical dictation systems, etc.), the investigation of new methods to model every-day speech has received a lot of attention among speech technologists. Also in the linguistic and psycholinguistic domain, casual conversations are studied on the search for an answer to how every-day speech production and comprehension works. Their studies have indicated that certain higher level linguistic functions and structures of utterances condition the details of their pronunciation. It is likely that the kind of analysis that is becoming feasible with the growing availability of large speech corpora will bring to light yet unknown factors that affect pronunciation variation. The research envisioned in this proposal is designed to increase our knowledge about spontaneous, conversational speech and to use this knowledge to improve Automatic Speech Recognition (ASR) systems. The first objective is to identify which higher level linguistic structures and functions condition pronunciation variation by means of quantitative phonetic analyses. Studies will be carried out on Dutch and on Austrian German material, which will allow to draw conclusions about which findings are language specific and which are characteristic for conversational speech in general. The second objective is to improve ASR technology by incorporating the gained knowledge about the conditions for pronunciation variation. Most ASR systems still deal with acoustic and linguistic information independently of each other. In contrast, I propose a Cross-layer pronunciation modeling technique, which (1) makes use of the gained knowledge about the effects of several layers of linguistic structures and functions on pronunciation variation, and (2) which means that the recognizer makes use of lexicons in more than just one layer of its architecture. Additional deliverables of this project are the collected speech material along with the created tools for its automatic annotation, which both would be of great value for future studies of linguists and engineers.
The Problem Automatic speech recognition (ASR) systems were originally designed to cope with carefully pronounced speech. Most real world applications of ASR systems, however, require the recognition of spontaneous, conversational speech (e.g., dialogue systems, voice input aids for physically disabled, medical dictation systems, etc.). Compared to prepared or read speech, conversational speech contains utterances that might be considered 'ungrammatical' and contain disfluencies, such as ...oh, well, I think ahhm exactly The pronunciation of the words may depend for instance on the regional background of the speakers, the formality of the situation or the frequency of the word. A highly frequent word like yesterday may sound like yeshay and the German word haben (to have) may sound like ham. This project focused on investigating interdisciplinary methods (including linguistics, phonetics, speech technology) to model the factors on which pronunciation variation depends in everyday speech. The Methods In this project, we collected and annotated the first largescale speech database of Austrian German. It is a rich resource on pronunciation variation in Austrian German, containing approximately 1900 minutes of speech spoken by 38 speakers from 5 provinces in 3 different speaking styles (read speech, spontaneous commands, and conversational speech). Moreover, it is one of the largest German speech databases with completely unconstrained and casual conversations, and thus is also relevant to speech scientists outside of Austria. We have also developed transcription tools for the corpus and have made both the speech material and the tools available for other researchers.The Findings Based on Dutch, German and the collected Austrian German speech material, we found that pronunciation variation does not only depend on well known factors such as the regional background of the speaker and the speaking style, but also on, for example, the grammatical and morphological properties of the words. For instance, whereas in spontaneous speech the German word der is pronounced differently depending on whether it is an article, a demonstrative pronoun or a relative pronoun, in read speech it is always pronounced the same way. These linguistic findings for pronunciation variation were used to develop methods to improve ASR systems. Most importantly, our work not only demonstrates novel methods for ASR, it introduces a new perspective: Whereas previously, the high degree of pronunciation variation in spontaneous speech was primarily seen as a problem for ASR, we view it as an additional resource which is not present in read speech. This change in perspective will guide our future research plans.
- Technische Universität Graz - 100%
- Mirjam Ernestus, Radboud University - Netherlands
Research Output
- 40 Citations
- 13 Publications
-
2017
Title Rethinking classification results based on read speech, or: why improvements do not always transfer to other speaking styles DOI 10.1007/s10772-017-9436-y Type Journal Article Author Schuppler B Journal International Journal of Speech Technology Pages 699-713 Link Publication -
2017
Title A corpus of read and conversational Austrian German DOI 10.1016/j.specom.2017.09.003 Type Journal Article Author Schuppler B Journal Speech Communication Pages 62-74 -
2017
Title Acoustic correlates of stress and accent in Standard Austrian German. Type Book Chapter Author El Zarka D -
2013
Title Informal speech processes can be categorical in nature, even if they affect many different words DOI 10.1121/1.4790352 Type Journal Article Author Hanique I Journal The Journal of the Acoustical Society of America Pages 1644-1655 Link Publication -
2018
Title On the use of acoustic features for automatic disambiguation of homophones in spontaneous German DOI 10.1016/j.csl.2017.12.011 Type Journal Article Author Schuppler B Journal Computer Speech & Language Pages 209-224 -
2014
Title Pronunciation Variation in Read and Conversational Austrian German. Type Conference Proceeding Abstract Author Morales-Cordovilla Ja Et Al Conference Proceedings of Interspeech -
2014
Title How extra-linguistic factors affect pronunciation variation in different speaking styles. Type Conference Proceeding Abstract Author Schuppler B Conference 22Nd Czech-German Workshop on Speech Communication. -
2014
Title GRASS: The Graz Corpus of Read and Spontaneous Speech. Type Conference Proceeding Abstract Author Pessentheiner H Et Al Conference Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). -
2013
Title The challenge of manner classification in conversational speech. Type Conference Proceeding Abstract Author Boves L Et Al Conference Proceedings of the Workshop on Speech Production in Automatic Speech Recognition, Satellite Workshop of Interspeech -
2015
Title Automatic detection of uncertainty in spontaneous German dialogue. Type Conference Proceeding Abstract Author Schrank T Conference Proceedings of Interspeech -
2014
Title Statistical Language and Speech Processing, Second International Conference, SLSP 2014, Grenoble, France, October 14-16, 2014, Proceedings DOI 10.1007/978-3-319-11397-5 Type Book Publisher Springer Nature -
2014
Title Automatic Phonetic Transcription in Two Steps: Forced Alignment and Burst Detection DOI 10.1007/978-3-319-11397-5_10 Type Book Chapter Author Schuppler B Publisher Springer Nature Pages 132-143 -
2014
Title Where /aR/ the /R/s in Standard Austrian German? Type Conference Proceeding Abstract Author Jackschina A Conference Proceedings of Interspeech