Prosodic Modeling in Text-to-Speech Synthesis

This paper discusses three broad obstacles that must be overcome to improve prosodic quality in text-to-speech systems. First, direct and indirect limits set by the signal processing (“synthesis”) components. Second, combinatorial and statistical constraints inherent in generalizing from training corpora to unrestricted domains, and that require the integration of contentspecific knowledge and detailed mathematical modeling. Third, the nature of many empirical research issues that must be solved for prosodic modeling to improve: they are often too focused and model-dependent for academe, and too long-term for development organizations.

[1]  Murray F. Spiegel,et al.  Using dynamic time warping to formulate duration rules for speech synthesis , 1989 .

[2]  Jan P. H. van Santen,et al.  Combinatorial issues in text-to-speech synthesis , 1997, EUROSPEECH.

[3]  Jan P. H. van Santen,et al.  Contextual effects on vowel duration , 1992, Speech Commun..

[4]  D. Klatt Letter: Interaction between two factors that influence vowel duration. , 1973, The Journal of the Acoustical Society of America.

[5]  John Kingston,et al.  Macro and micro F0 in the synthesis of intonation , 1990 .

[6]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[7]  Christophe d'Alessandro,et al.  Modification of the Aperiodic Component of Speech Signals for Synthesis , 1997 .

[8]  J. V. Santen Exploring N -way tables with sums-of-products models , 1993 .

[9]  Christof Traber F0 generation with a data base of natural F0 patterns and with a neural network , 1990, SSW.

[10]  Hiroya Fujisaki,et al.  Dynamic Characteristics of Voice Fundamental Frequency in Speech and Singing , 1983 .

[11]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[12]  Matthias Pätzold,et al.  Analysis and synthesis of German F0 contours by means of Fujisaki's model , 1993, Speech Commun..

[13]  T. Crystal,et al.  Segmental durations in connected-speech signals: Syllabic stress , 1988 .

[14]  J. V. Santen,et al.  Effects of postvocalic voicing on the time course of vowels and diphthongs , 1992 .

[15]  D. Klatt Vowel Lengthening is Syntactically Determined in a Connected Discourse. , 1975 .

[16]  Richard Sproat Multilingual Text-to-Speech Synthesis , 1997 .

[17]  Mariapaola D'Imperio,et al.  Perception of questions and statements in Neapolitan Italian , 1997, EUROSPEECH.

[18]  Bernd Möbius,et al.  Modeling Pitch Accent Curves , 1997 .

[19]  Jan P. H. van Santen,et al.  Segmental Duration and Speech Timing , 1997, Computing Prosody.