Hidden semi-Markov model based speech synthesis

In the present paper, a hidden-semi Markov model (HSMM) based speech synthesis system is proposed. In a hidden Markov model (HMM) based speech synthesis system which we have proposed, rhythm and tempo are controlled by state duration probability distributions modeled by single Gaussian distributions. To synthesis speech, it constructs a sentence HMM corresponding to an arbitralily given text and determine state durations maximizing their probabilities, then a speech parameter vector sequence is generated for the given state sequence. However, there is an inconsistency: although the speech is synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them. In the present paper, we introduce an HSMM, which is an HMM with explicit state duration probability distributions, into the HMM-based speech synthesis system. Experimental results show that the use of HSMM training improves the naturalness of the synthesized speech.

[1]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[3]  M. D. Riley Tree-based modeling of segmental durations , 1992 .

[4]  Gérard Bailly,et al.  Talking Machines: Theories, Models, and Designs , 1992 .

[5]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[7]  Chilin Shih,et al.  Multi-lingual duration modeling , 1997, EUROSPEECH.

[8]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.