Whistler: a trainable text-to-speech system

We introduce Whistler, a trainable text to speech (TTS) system that automatically learns the model parameters from a corpus. Both prosody parameters and concatenative speech units are derived through the use of probabilistic learning methods that have been successfully used for speech recognition. Whistler can produce synthetic speech that sounds very natural and resembles the acoustic and prosodic characteristics of the original speaker. The underlying technologies used in Whistler can significantly facilitate the process of creating generic TTS systems for a new language, a new voice, or a new speech style.

[1]  Kenneth N. Ross Modeling of intonation for speech synthesis , 1995 .

[2]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[3]  Yoshinori Sagisaka,et al.  ATR μ-talk speech synthesis system , 1992, ICSLP.

[4]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[5]  Gérard Bailly,et al.  Talking Machines: Theories, Models, and Designs , 1992 .

[6]  Julia Hirschberg,et al.  Pitch Accent in Context: Predicting Intonational Prominence from Text , 1993, Artif. Intell..

[7]  Karen Jensen,et al.  Natural Language Processing: The PLNLP Approach , 2013, Natural Language Processing.

[8]  Mei-Yuh Hwang,et al.  Microsoft Windows highly intelligent speech recognizer: Whisper , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[9]  Philip C. Woodland,et al.  Improvements in an HMM-based speech synthesiser , 1995, EUROSPEECH.

[10]  Mei-Yuh Hwang,et al.  Predicting unseen triphones with senones , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  S. Nakajima,et al.  Automatic generation of synthesis units based on context oriented clustering , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[12]  David Yarowsky,et al.  A corpus-based synthesizer , 1992, ICSLP.