论文信息 - Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis

Perceptually based automatic prosody labeling and prosodically enriched unit selection improve concatenative text-to-speech synthesis

Prosody is an important factor in the quality of text-tospeech (TTS) synthesis. Typically, acoustic parameters such as f0 and duration are the only variables related to prosody that are used to determine unit selection. Our study explored adding the explicit use of linguistically and perceptually motivated prosodic categories in unit selection-based TTS. One of our goals was to automate the process of prosodically labeling our TTS inventory. However, reliability among labelers for some ToBI[6] (Tones and Break Indices) categories was too low[9] for successful training of an automatic prosody recognizer. We developed a prosody labeling system simpler and more robust than standard EToBI (English ToBI). This \ToBI Lite" system was used successfully for automatic labeling of the acoustic inventory and in prosodically enriched unit selection. A formal listening test was conducted to compare subjective quality ratings for several variations of the AT&T unit selection concatenative TTS system that di ered only in their method of prosodic labeling of the inventory or their use of prosody for unit selection. The use of simple prosodic categories in unit selection signi cantly improved ratings, and automatic prosodic labeling resulted in higher ratings than manual labeling.

Ann K. Syrdal | Colin W. Wightman | Georg Stemmer | Alistair Conkie | Marc C. Beutnagel

[1] Gayle M. Ayers. Nuclear Accent Types and Prominence: Some Psycholinguistic Experiments / , 1996 .

[2] Ann K. Syrdal,et al. Inter-transcriber reliability of toBI prosodic labeling , 2000, INTERSPEECH.

[3] Mari Ostendorf,et al. Automatic labeling of prosodic patterns , 1994, IEEE Trans. Speech Audio Process..

[4] Angelien Sanderman,et al. On the perceptual strength of prosodic boundaries and its relation to suprasegmental cues , 1994 .

[5] Eric Moulines,et al. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[6] G. Fant,et al. Speech , Music and Hearing Quarterly Progress and Status Report Preliminaries to the study of Swedish prose reading and reading style , 2007 .

[7] Barbara Heuft,et al. Towards a prominence-based synthesis system , 1997, Speech Commun..

[8] Thierry Dutoit,et al. Diphone concatenation using a harmonic plus noise model of speech , 1997, EUROSPEECH.

[9] Julia Hirschberg,et al. Automatic ToBI prediction and alignment to speed manual labeling of prosody , 2001, Speech Commun..