A hidden Markov-model-based trainable speech synthesizer

This paper presents a new approach to speech synthesis in which a set of cross-word decision-tree state-clustered context-dependent hidden Markov models are used to define a set of subphone units to be used in a concatenation synthesizer. The models, trees, waveform segments and other parameters representing each clustered state are obtained completely automatically through training on a 1 hour single-speaker continuous-speech database. During synthesis the required utterance, specified as a string of words of known phonetic pronounciation, is generated as a sequence of these clustered states using a TD-PSOLA waveform concatenation synthesizer. The system produces speech which, though in a monotone, is both natural sounding and highly intelligible. A Modified Rhyme Test conducted to measure segmental intelligibility yielded a 5· 0 % error rate. The speech produced by the system mimics the voice of the speaker used to record the training database. The system can be retrained on a new voice in less than 48 hours, and has been successfully trained on four voices.

[1]  Dennis H. Klatt,et al.  The klattalk text-to-speech conversion system , 1982, ICASSP.

[2]  Shin'ya Nakajima,et al.  A new waveform speech synthesis approach based on the COC speech spectrum , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Piero Pierucci,et al.  Phonetic ergodic HMM for speech synthesis , 1991, EUROSPEECH.

[4]  Michael Picheny,et al.  Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[5]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Olivier Boëffard,et al.  Automatic segmentation and quality evaluation of speech unit inventories for concatenation-based, multilingual PSOLA text-to-speech systems , 1993, EUROSPEECH.

[7]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[8]  Yoshinori Sagisaka,et al.  Tree-based unit selection for English speech synthesis , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Michael Riley,et al.  Automatic segmentation and labeling of speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[10]  S. Nakajima,et al.  Automatic generation of synthesis units based on context oriented clustering , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[11]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[12]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Stephen D. Isard,et al.  Automatic diphone segmentation , 1991, EUROSPEECH.

[14]  Yoshinori Sagisaka,et al.  Concatenative speech synthesis by minimum distortion criteria , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Joseph P. Olive A new algorithm for a concatenative speech synthesis system using an augmented acoustic inventory of speech sounds , 1990, SSW.

[16]  Olivier Boëffard,et al.  Multilingual PSOLA text-to-speech system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Andrej Ljolje,et al.  Automatic segmentation of speech for TTS , 1993, EUROSPEECH.

[18]  N. Dixon,et al.  Terminal analog synthesis of continuous speech using the diphone method of segment assembly , 1968 .

[19]  William I. Hallahan DECtalk Software: Text-to-Speech Technology and Implementation , 1995, Digit. Tech. J..

[20]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[21]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[22]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[23]  K. D. Kryter,et al.  ARTICULATION-TESTING METHODS: CONSONANTAL DIFFERENTIATION WITH A CLOSED-RESPONSE SET. , 1965, The Journal of the Acoustical Society of America.

[24]  Philip C. Woodland,et al.  Automatic speech synthesiser parameter estimation using HMMs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[25]  D B Pisoni,et al.  Segmental intelligibility of synthetic speech produced by rule. , 1989, The Journal of the Acoustical Society of America.

[26]  Shin'ya Nakajima English speech synthesis based on multi-layered context oriented clustering; towards multi-lingual speech synthesis , 1993, EUROSPEECH.

[27]  Olivier Boëffard,et al.  Automatic generation of optimized unit dictionaries for text to speech synthesis , 1992, ICSLP.

[28]  E Abberton,et al.  First applications of a new laryngograph. , 1971, Medical & biological illustration.

[29]  Alexander G. Hauptmann,et al.  SPEAKEZ: a first experiment in concatenation synthesis from a large corpus , 1993, EUROSPEECH.

[30]  Steve Young,et al.  Tree-based state clustering for large vocabulary speech recognition , 1994, Proceedings of ICSIPNN '94. International Conference on Speech, Image Processing and Neural Networks.

[31]  Hsiao-Wuen Hon,et al.  Allophone clustering for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[32]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[33]  Maurizio Omologo,et al.  A HMM-based system for automatic segmentation and labeling of speech , 1992, ICSLP.

[34]  Francis Charpentier,et al.  Diphone synthesis using an overlap-add technique for speech waveforms concatenation , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Jan P. H. van Santen,et al.  Assignment of segmental duration in text-to-speech synthesis , 1994, Comput. Speech Lang..

[36]  Françoise Emerard,et al.  Sparte: A text-to-speech machine using synthesis by diphones , 1982, ICASSP.

[37]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[38]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[39]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.