Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis

This paper describes a speaker-adaptive HMM-based speech synthesis system. The new system, called ldquoHTS-2007,rdquo employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our previous systems. Subjective evaluation results show that the new system generates significantly better quality synthetic speech than speaker-dependent approaches with realistic amounts of speech data, and that it bears comparison with speaker-dependent approaches even when large amounts of speech data are available. In addition, a comparison study with several speech synthesis techniques shows the new system is very robust: It is able to build voices from less-than-ideal speech data and synthesize good-quality speech even for out-of-domain sentences.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[4]  Jae S. Lim,et al.  Multiband excitation vocoder , 1988, IEEE Transactions on Acoustics, Speech, and Signal Processing.

[5]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[6]  Thomas P. Barnwell,et al.  MCCREE AND BARNWELL MIXED EXCITAmON LPC VOCODER MODEL LPC SYNTHESIS FILTER 243 SYNTHESIZED SPEECH-PERIODIC PULSE TRAIN-1 PERIODIC POSITION JITTER PULSE 4 , 2004 .

[7]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[9]  Mary P. Harper,et al.  On the complexity of explicit duration HMM's , 1995, IEEE Trans. Speech Audio Process..

[10]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[11]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[14]  Vassilios Digalakis,et al.  Speaker adaptation using combined transformation and Bayesian methods , 1996, IEEE Trans. Speech Audio Process..

[15]  Martine Grice,et al.  The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences , 1996, Speech Commun..

[16]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[17]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[18]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[19]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[20]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[21]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[22]  Susan Fitt,et al.  Synthesis of regional English using a keyword lexicon , 1999, EUROSPEECH.

[23]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[24]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[25]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[26]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[27]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[28]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[29]  Chin-Hui Lee,et al.  A structural Bayes approach to speaker adaptation , 2001, IEEE Trans. Speech Audio Process..

[30]  Keiichi Tokuda,et al.  Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[31]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[32]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[33]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[34]  K. Tokuda,et al.  A Training Method of Average Voice Model for HMM-Based Speech Synthesis , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[35]  H. Kobayashi,et al.  An efficient forward-backward algorithm for an explicit-duration hidden Markov model , 2003, IEEE Signal Processing Letters.

[36]  Takao Kobayashi,et al.  Robust F0 Estimation of Speech Signal Using Harmonicity Measure Based on Instantaneous Frequency , 2004, IEICE Trans. Inf. Syst..

[37]  Keiichi Tokuda,et al.  Speaker adaptation of pitch and spectrum for HMM-based speech synthesis , 2002, Systems and Computers in Japan.

[38]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005 .

[39]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005, Systems and Computers in Japan.

[40]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[41]  Takao Kobayashi,et al.  Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[42]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[43]  Zhenhua Ling HMM-based Unit Selection Using F , 2006 .

[44]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[45]  Alan W. Black,et al.  The Blizzard Challenge 2006 , 2006 .

[46]  Ren-Hua Wang,et al.  HMM-Based Emotional Speech Synthesis Using Average Emotion Model , 2006, ISCSLP.

[47]  Sherif Abdou,et al.  Improving Arabic HMM based speech synthesis quality , 2006, INTERSPEECH.

[48]  Takao Kobayashi,et al.  A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features , 2006, IEICE Trans. Inf. Syst..

[49]  Sadaoki Furui,et al.  New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer , 2006, Speech Commun..

[50]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006 .

[51]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[52]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[53]  Takao Kobayashi,et al.  Model Adaptation Approach to Speech Synthesis with Diverse Voices and Styles , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[54]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[55]  Heiga Zen,et al.  State Duration Modeling for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[56]  Heng Lu,et al.  The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 , 2007 .

[57]  Heiga Zen,et al.  Speaker-Independent HMM-based Speech Synthesis System: HTS-2007 System for the Blizzard Challenge 2007 , 2007 .

[58]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[59]  Simon King,et al.  The Blizzard Challenge 2007 , 2007 .

[60]  H. Zen IMPLEMENTING AN HSMM-BASED SPEECH SYNTHESIS SYSTEM USING AN EFFICIENT FORWARD-BACKWARD ALGORITHM , 2007 .

[61]  Junichi Yamagishi,et al.  Festival multisyn voices for the 2007 blizzard challenge. , 2007 .

[62]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[63]  Ren-Hua Wang,et al.  HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[64]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[65]  Heiga Zen,et al.  A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[66]  Keiichi Tokuda,et al.  ATRECSS — ATR ENGLISH SPEECH CORPUS FOR SPEECH SYNTHESIS , 2007 .

[67]  Simon King,et al.  Multisyn: Open-domain unit selection for the Festival speech synthesis system , 2007, Speech Commun..

[68]  Simon King,et al.  Statistical analysis of the Blizzard Challenge 2007 listening test results , 2007 .

[69]  Junichi Yamagishi,et al.  Combining Statistical Parameteric Speech Synthesis and Unit-Selection for Automatic Voice Cloning , 2008 .

[70]  Oliver Watts,et al.  HMM-based synthesis of child speech , 2008, WOCCI.

[71]  Heiga Zen,et al.  The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006 , 2008, IEICE Trans. Inf. Syst..

[72]  Heiga Zen,et al.  The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge , 2008 .

[73]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[74]  Simon King,et al.  The Blizzard Challenge 2009 , 2009 .