An introduction of trajectory model into HMM-based speech synthesis

In the synthesis part of a hidden Markov model (HMM) based speech synthesis system which we have proposed, a speech parameter vector sequence is generated from a sentence HMM corresponding to an arbitrarily given text by using a speech parameter generation algorithm. However, there is an inconsistency: although the speech parameter vector sequence is generated under the constraints between static and dynamic features, HMM parameters are trained without any constraints between them in the same way as standard HMM training. In the present paper, we introduce a trajectory-HMM, which has been derived from the HMM under the constraints between static and dynamic features, into the training part of the HMM-based speech synthesis system. Experimental results show that the use of trajectory-HMM training improves the quality of the synthesized speech.

[1]  Philip C. Woodland,et al.  Automatic speech synthesiser parameter estimation using HMMs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[3]  Massimo Giustiniani,et al.  A hidden Markov model approach to speech synthesis , 1989, EUROSPEECH.

[4]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[5]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[6]  Andrej Ljolje,et al.  Automatic speech segmentation for concatenative inventory selection , 1994, SSW.

[7]  Piero Pierucci,et al.  Phonetic ergodic HMM for speech synthesis , 1991, EUROSPEECH.

[8]  Alex Acero,et al.  Automatic generation of synthesis units for trainable text-to-speech systems , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[9]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[10]  Heiga Zen,et al.  A Viterbi algorithm for a trajectory model derived from HMM with explicit relationship between static and dynamic features , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Heiga Zen,et al.  Trajectory modeling based on HMMs with the explicit relationship between static and dynamic features , 2003, INTERSPEECH.

[13]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[14]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[15]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[16]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.