Speech parameter generation algorithms for HMM-based speech synthesis

This paper derives a speech parameter generation algorithm for HMM-based speech synthesis, in which the speech parameter sequence is generated from HMMs whose observation vector consists of a spectral parameter vector and its dynamic feature vectors. In the algorithm, we assume that the state sequence (state and mixture sequence for the multi-mixture case) or a part of the state sequence is unobservable (i.e., hidden or latent). As a result, the algorithm iterates the forward-backward algorithm and the parameter generation algorithm for the case where the state sequence is given. Experimental results show that by using the algorithm, we can reproduce clear formant structure from multi-mixture HMMs as compared with that produced from single-mixture HMMs.

[1]  Massimo Giustiniani,et al.  A hidden Markov model approach to speech synthesis , 1989, EUROSPEECH.

[2]  Piero Pierucci,et al.  Phonetic ergodic HMM for speech synthesis , 1991, EUROSPEECH.

[3]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Andrej Ljolje,et al.  Automatic speech segmentation for concatenative inventory selection , 1994, SSW.

[5]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[7]  Philip C. Woodland,et al.  Automatic speech synthesiser parameter estimation using HMMs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[9]  T. Masuko Speech synthesis from HMMs using dynamic features , 1996 .

[10]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[11]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[12]  K. Koishida,et al.  Vector quantization of speech spectral parameters using statistics of dynamic features , 1997 .

[13]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Alex Acero,et al.  Automatic generation of synthesis units for trainable text-to-speech systems , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[17]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[18]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[19]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[20]  Alex Acero,et al.  Formant analysis and synthesis using hidden Markov models , 1999, EUROSPEECH.