A Hidden Semi-Markov Model-Based Speech Synthesis System

A statistical speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of speech are modeled simultaneously by context-dependent HMMs, and speech parameter vector sequences are generated from the HMMs themselves. This system defines a speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although state duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the synthesized speech sound less natural. In this paper, we propose a statistical speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit state duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the state duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of synthesized speech.

[1]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[2]  Gérard Bailly,et al.  Talking Machines: Theories, Models, and Designs , 1992 .

[3]  Chilin Shih,et al.  Multi-lingual duration modeling , 1997, EUROSPEECH.

[4]  Keiichi Tokuda,et al.  Pitch pattern generation using multispace probability distribution HMM , 2002, Systems and Computers in Japan.

[5]  F. Park ROBUST UNIT SELECTION SYSTEM FOR SPEECH SYNTHESIS , 1999 .

[6]  Alan W. Black,et al.  CHATR: a generic speech synthesis system , 1994, COLING.

[7]  Keiichi Tokuda,et al.  Investigation of State Duration Model based on Gamma distribution for HMM-based Speech Synthesis , 2001 .

[8]  Keiichi Tokuda,et al.  Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  David Yarowsky,et al.  A corpus-based synthesizer , 1992, ICSLP.

[12]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[13]  Jeff A. Bilmes,et al.  Robust splicing costs and efficient search with BMM Models for concatenative speech synthesis , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[15]  Keiichi Tokuda,et al.  Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[16]  Chilin Shih,et al.  Duration Study for the Bell Laboratories Mandarin Text-to-Speech System , 1997 .

[17]  Koichi Shinoda,et al.  Acoustic modeling based on the MDL principle for speech recognition , 1997, EUROSPEECH.

[18]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[19]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[21]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[22]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[23]  Mari Ostendorf,et al.  HMM topology design using maximum likelihood successive state splitting , 1997, Comput. Speech Lang..

[24]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[25]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[26]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[27]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[28]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[29]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[30]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[31]  Heiga Zen,et al.  An overview of nitech HMM-based speech synthesis system for blizzard challenge 2005 , 2005, INTERSPEECH.

[32]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[33]  Shinsuke Sakai,et al.  A probabilistic approach to unit selection for corpus-based speech synthesis , 2005, INTERSPEECH.

[34]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[35]  Mary P. Harper,et al.  On the complexity of explicit duration HMM's , 1995, IEEE Trans. Speech Audio Process..

[36]  Steve Young,et al.  Benchmark DARPA RM results using the HTK portable HMM toolkit , 1992 .

[37]  Mei-Yuh Hwang,et al.  Predicting unseen triphones with senones , 1996, IEEE Trans. Speech Audio Process..

[38]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[39]  Cyril Allauzen,et al.  Statistical Modeling for Unit Selection in Speech Synthesis , 2004, ACL.

[40]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[41]  Yoshinori Sagisaka,et al.  Statistical modelling of speech segment duration by constrained tree regression , 2000 .

[42]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.