论文信息 - A hidden Markov-model-based trainable speech synthesizer

A hidden Markov-model-based trainable speech synthesizer

This paper presents a new approach to speech synthesis in which a set of cross-word decision-tree state-clustered context-dependent hidden Markov models are used to define a set of subphone units to be used in a concatenation synthesizer. The models, trees, waveform segments and other parameters representing each clustered state are obtained completely automatically through training on a 1 hour single-speaker continuous-speech database. During synthesis the required utterance, specified as a string of words of known phonetic pronounciation, is generated as a sequence of these clustered states using a TD-PSOLA waveform concatenation synthesizer. The system produces speech which, though in a monotone, is both natural sounding and highly intelligible. A Modified Rhyme Test conducted to measure segmental intelligibility yielded a 5· 0 % error rate. The speech produced by the system mimics the voice of the speaker used to record the training database. The system can be retrained on a new voice in less than 48 hours, and has been successfully trained on four voices.

Philip C. Woodland | Robert E. Donovan | P. Woodland | R. E. Donovan

[1] Dennis H. Klatt,et al. The klattalk text-to-speech conversion system , 1982, ICASSP.

[2] Shin'ya Nakajima,et al. A new waveform speech synthesis approach based on the COC speech spectrum , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Piero Pierucci,et al. Phonetic ergodic HMM for speech synthesis , 1991, EUROSPEECH.

[4] Michael Picheny,et al. Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[5] K. Tokuda,et al. Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6] Olivier Boëffard,et al. Automatic segmentation and quality evaluation of speech unit inventories for concatenation-based, multilingual PSOLA text-to-speech systems , 1993, EUROSPEECH.

[7] Eric Moulines,et al. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[8] Yoshinori Sagisaka,et al. Tree-based unit selection for English speech synthesis , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9] Michael Riley,et al. Automatic segmentation and labeling of speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[10] S. Nakajima,et al. Automatic generation of synthesis units based on context oriented clustering , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[11] Jj Odell,et al. The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[12] Steve J. Young,et al. Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Stephen D. Isard,et al. Automatic diphone segmentation , 1991, EUROSPEECH.

[14] Yoshinori Sagisaka,et al. Concatenative speech synthesis by minimum distortion criteria , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15] Joseph P. Olive. A new algorithm for a concatenative speech synthesis system using an augmented acoustic inventory of speech sounds , 1990, SSW.

[16] Olivier Boëffard,et al. Multilingual PSOLA text-to-speech system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17] Andrej Ljolje,et al. Automatic segmentation of speech for TTS , 1993, EUROSPEECH.

[18] N. Dixon,et al. Terminal analog synthesis of continuous speech using the diphone method of segment assembly , 1968 .

[19] William I. Hallahan. DECtalk Software: Text-to-Speech Technology and Implementation , 1995, Digit. Tech. J..

[20] David B. Pisoni,et al. Text-to-speech: the mitalk system , 1987 .

[21] Keiichi Tokuda,et al. An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[22] Nick Campbell,et al. Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[23] K. D. Kryter,et al. ARTICULATION-TESTING METHODS: CONSONANTAL DIFFERENTIATION WITH A CLOSED-RESPONSE SET. , 1965, The Journal of the Acoustical Society of America.

[24] Philip C. Woodland,et al. Automatic speech synthesiser parameter estimation using HMMs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[25] D B Pisoni,et al. Segmental intelligibility of synthetic speech produced by rule. , 1989, The Journal of the Acoustical Society of America.

[26] Shin'ya Nakajima. English speech synthesis based on multi-layered context oriented clustering; towards multi-lingual speech synthesis , 1993, EUROSPEECH.

[27] Olivier Boëffard,et al. Automatic generation of optimized unit dictionaries for text to speech synthesis , 1992, ICSLP.

[28] E Abberton,et al. First applications of a new laryngograph. , 1971, Medical & biological illustration.

[29] Alexander G. Hauptmann,et al. SPEAKEZ: a first experiment in concatenation synthesis from a large corpus , 1993, EUROSPEECH.

[30] Steve Young,et al. Tree-based state clustering for large vocabulary speech recognition , 1994, Proceedings of ICSIPNN '94. International Conference on Speech, Image Processing and Neural Networks.

[31] Hsiao-Wuen Hon,et al. Allophone clustering for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[32] S. J. Young,et al. Tree-based state tying for high accuracy acoustic modelling , 1994 .

[33] Maurizio Omologo,et al. A HMM-based system for automatic segmentation and labeling of speech , 1992, ICSLP.

[34] Francis Charpentier,et al. Diphone synthesis using an overlap-add technique for speech waveforms concatenation , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35] Jan P. H. van Santen,et al. Assignment of segmental duration in text-to-speech synthesis , 1994, Comput. Speech Lang..

[36] Françoise Emerard,et al. Sparte: A text-to-speech machine using synthesis by diphones , 1982, ICASSP.

[37] D H Klatt,et al. Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[38] Leo Breiman,et al. Classification and Regression Trees , 1984 .

[39] Alan W. Black,et al. Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.