The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge

F or the 2008 Blizzard Challenge, we used the same speakeradaptive approach to HMM-based speech synthesis that was used in the HTS entry to the 2007 challenge, but an improved system was built in which the multi-accented English average voice model was trained on 41 hours of speech data with highorder mel-cepstral analysis using an efficient forward-backward algorithm for the HSMM. The listener evaluation scores for the synthetic speech generated from this system was much better than in 2007: the system had the equal best naturalness on the small English data set and the equal best intelligibility on both small and large data sets for English, and had the equal best naturalness on the Mandarin data. In fact, the English system was found to be as intelligible as human speech. Index Terms: speech synthesis, HMM, HTS, speaker adaptation

[1]  Susan Fitt,et al.  Synthesis of regional English using a keyword lexicon , 1999, EUROSPEECH.

[2]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[3]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[4]  Heiga Zen,et al.  The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006 , 2008, IEICE Trans. Inf. Syst..

[5]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[6]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[7]  H. Kobayashi,et al.  An efficient forward-backward algorithm for an explicit-duration hidden Markov model , 2003, IEEE Signal Processing Letters.

[8]  Alan W. Black,et al.  The Blizzard Challenge 2006 , 2006 .

[9]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[10]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[11]  Heiga Zen,et al.  An introduction of trajectory model into HMM-based speech synthesis , 2004, SSW.

[12]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[13]  H. Zen IMPLEMENTING AN HSMM-BASED SPEECH SYNTHESIS SYSTEM USING AN EFFICIENT FORWARD-BACKWARD ALGORITHM , 2007 .

[14]  Heiga Zen,et al.  Speaker-Independent HMM-based Speech Synthesis System: HTS-2007 System for the Blizzard Challenge 2007 , 2007 .

[15]  Junichi Yamagishi,et al.  Festival multisyn voices for the 2007 blizzard challenge. , 2007 .

[16]  K. Tokuda,et al.  A Training Method of Average Voice Model for HMM-Based Speech Synthesis , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[17]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[18]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[19]  Simon King,et al.  Multisyn: Open-domain unit selection for the Festival speech synthesis system , 2007, Speech Commun..

[20]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[21]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006 .

[22]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Heiga Zen,et al.  A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[24]  Mary P. Harper,et al.  On the complexity of explicit duration HMM's , 1995, IEEE Trans. Speech Audio Process..

[25]  Simon King,et al.  The Blizzard Challenge 2007 , 2007 .

[26]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[27]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .