论文信息 - Toshiba English text-to-speech synthesizer (TESS)

Toshiba English text-to-speech synthesizer (TESS)

Toshiba English Text-to-Speech Synthesizer utilizes several new techniques to produce synthesized speech that is more natural-sounding and intelligible than that created by conventional synthesizers. The closed-loop training method creates synthesis units that most closely resemble the training data and are the least susceptible to prosodic distortion noise by analytically solving an equation that minimizes distortion between target units and training data. The pitch contour model creates a codebook of representative word-based F0 contours by first clustering the training data using word stress and syllable numbers. Within each cluster, the training data is divided into different groups using lexical and phonological attributes of each word. In each group, a representative contour is created using approximate error estimation. The resulting approximate errors are used in offset level prediction for each contour. These techniques have significantly improved the prosodic quality, naturalness and intelligibility of the resulting synthesized speech.

Takehiko Kagoshima | Masami Akamine | Masahiro Morita | Shigenobu Seto | Chang K. Suh

[1] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[2] Alex Acero,et al. Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3] Julia Hirschberg,et al. Segmental effects on timing and height of pitch contours , 1994, ICSLP.

[4] Shin'ya Nakajima,et al. A new waveform speech synthesis approach based on the COC speech spectrum , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[5] Takehiko Kagoshima,et al. An F0 contour control model for totally speaker driven text to speech system , 1998, ICSLP.

[6] Kazue Hata,et al. Common patterns in word level prosody , 1998, ICSLP.

[7] Chikio Hayashi. On the quantification of qualitative data from the mathematico-statistical point of view , 1950 .

[8] Takehiko Kagoshima,et al. Analytic generation of synthesis units by closed loop training for totally speaker driven text to speech system (TOS drive TTS) , 1998, ICSLP.

[9] Julia Hirschberg,et al. Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..