Adapting Prosody in a Text-to-Speech System

The requirements of the evolving information communication technologies (ICT) place new demands on text-to-speech (TTS) systems. The modern high quality TTS system has to be capable of fast and high-quality adaptation to a new language, voice or even expressive speech. Thus adaptation to new voices with different prosodic characteristics is desired. In this chapter a survey of recent and past approaches of prosodic processing in text-tospeech synthesis will be discussed. Regardless of the different approaches which have been proposed ranging from generating prosody by rule to huge databases covering almost all prosodic patterns of a specific speaker there is clearly still much work to be done (van Santen et al., 2008). Automatic learning techniques seem to offer the fastest solution in adapting a TTS system to a new language, voice or a new application. They allow automatic extraction of specific features (e.g. non-uniform unit selection, prosodic regularities extraction) from an appropriate database of natural speech. Such techniques depend on the construction of a large pre-processed corpora (properly segmented, labelled with appropriate prosody labels, etc.). Despite the overall impression that TTS is an inferior task compared to speech recognition, TTS research and development community was not able to produce massive series of consumer products since the early 80es (Dutoit, 2008). Since then a broad spectrum of systems has been developed and successfully implemented – prosody was one of the major tasks to tackle in such systems. The term “prosody” covers a wide range of features characterizing “the musical qualities” of speech, including phrasing, pitch, loudness, tempo and rhythm. A number of studies suggest that prosody has a great impact on the intelligibility and naturalness of speech perception. Despite the fact that synthesized speech is nowadays mostly intelligible and in some cases sounds undistinguishable from human speech, it still lacks the flexibility and appropriate rendering of expressivity in the synthesized voice. Text-to-prosody systems based on the use of prosodic databases extracted from natural speech are a key point for development of new TTS systems. One of the major problems in TTS synthesis consists in the automatic generation of natural and intelligible prosody. Therefore the preparation of suitable speech-corpora for automatic prosodic feature extraction is essential.

[1]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[2]  Holzapfel Martin HMM‐based database segmentation and unit selection for concatenative speech synthesis , 1999 .

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  Hans-Georg Zimmermann,et al.  A data-driven method for input feature selection within neural prosody generation , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Justin Fackrell,et al.  Designing prosodic databases for automatic modelling in 6 languages , 1998, SSW.

[6]  Jacob Benesty,et al.  Springer handbook of speech processing , 2007, Springer Handbooks.

[7]  Ralf Kompe,et al.  Prosody in Speech Understanding Systems , 1997, Lecture Notes in Computer Science.

[8]  Barbara Heuft,et al.  Prosody generation with a neural network , 1996 .

[9]  Bogomir Horvat,et al.  Labeling of Symbolic Prosody Breaks for the Slovenian Language , 2003, Int. J. Speech Technol..

[10]  Rüdiger Hoffmann,et al.  Natural F0 contours with a new neural-network-hybrid approach , 2000, INTERSPEECH.

[11]  Christof Traber F0 generation with a data base of natural F0 patterns and with a neural network , 1990, SSW.

[12]  Horst-Udo Hain Automation of the training procedures for neural networks performing multi-lingual grapheme to phoneme conversion , 1999, EUROSPEECH.

[13]  Horst-Udo Hain,et al.  A multi-lingual system for the determination of phonetic word stress using soft feature selection by neural networks , 2001, SSW.

[14]  Matej Rojc,et al.  Design of Optimal Slovenian Speech Corpus for Use in the Concatenative Speech Synthesis System , 2000, LREC.

[15]  Justin Fackrell,et al.  Automatic prosodic labeling of 6 languages , 1998, ICSLP.

[16]  N. Campbell,et al.  Voice Quality : the 4 th Prosodic Dimension , 2004 .

[17]  Barbara Heuft,et al.  Prosody generation with a neural network: weighing the importance of input parameters , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[19]  Lutz Prechelt,et al.  Early Stopping-But When? , 1996, Neural Networks: Tricks of the Trade.

[20]  Hans-Georg Zimmermann,et al.  Segmental duration control by time delay neural networks with asymmetric causal and retro-causal information flows , 2002, ESANN.

[21]  Halewijn Vereecken,et al.  Improving the phonetic annotation by means of prosodic phrasing , 1997, EUROSPEECH.

[22]  Thierry Dutoit Corpus-Based Speech Synthesis , 2008 .

[23]  Ralph Neuneier,et al.  Robust generation of symbolic prosody by a neural classifier based on autoassociators , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[24]  Fabio Tamburini,et al.  Automatic detection of prosodic prominence in continuous speech , 2002, LREC.

[25]  Rüdiger Hoffmann,et al.  Robust unit selection based on syllable prosody parameters , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[26]  Bogomir Horvat,et al.  Designing Prosodic Databases for Automatic Modeling of Slovenian Language in a Multilingual TTS System , 2002, LREC.

[27]  Rüdiger Hoffmann,et al.  Data-driven importance analysis of linguistic and phonetic information , 2000, INTERSPEECH.

[28]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[29]  Vincent J. van Heuven,et al.  Acoustic correlates of linguistic stress and accent in Dutch and American English , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[30]  Ralph Neuneier,et al.  Modeling Dynamical Systems by Error Correction Neural Networks , 2002 .

[31]  J. V. Santen,et al.  The analysis of contextual effects on segmental duration , 1990 .

[32]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[33]  E. Nöth,et al.  Recognition of Selected Prosodic Events in Slovenian Speech , 2022 .

[34]  Slovenian Lang,et al.  An Environment for Word Prominence Classification in Slovenian Language , 2003 .

[35]  Peter Jackson,et al.  Overview of Current Text-to-Speech Techniques: Part II - Prosody and Speech Generation , 1996 .

[36]  Nick Campbell,et al.  A nonlinear unit selection strategy for concatenative speech synthesis based on syllable level features , 1998, ICSLP.