Recent development of the HMM-based speech synthesis system (HTS)

A statistical parametric approach to speech synthesis based on hidden Markov models (HMMs) has grown in popularity over the last few years. In this approach, spectrum, excitation, and duration of speech are simultaneously modeled by context-dependent HMMs, and speech waveforms are generated from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named “HMMbased speech synthesis system (HTS)” to provide a research and development toolkit for statistical parametric speech synthesis. This paper describes recent developments of HTS in detail, as well as future release plans.

[1]  Frank K. Soong,et al.  A MSD-HMM Approach to Pen Trajectory Modeling for Online Handwriting Recognition , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[2]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[3]  Takao Kobayashi,et al.  Human Walking Motion Synthesis with Desired Pace and Stride Length Based on HSMM , 2005, IEICE Trans. Inf. Syst..

[4]  Takao Kobayashi,et al.  A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features , 2006, IEICE Trans. Inf. Syst..

[5]  David Yarowsky,et al.  A corpus-based synthesizer , 1992, ICSLP.

[6]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[7]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[9]  Heiga Zen,et al.  A Fully Consistent Hidden Semi-Markov Model-Based Speech Recognition System , 2008, IEICE Trans. Inf. Syst..

[10]  Heiga Zen,et al.  Improving the performance of HMM-based very low bit rate speech coding , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[12]  Ren-Hua Wang,et al.  Improving the performance of HMM-based voice conversion using context clustering decision tree and appropriate regression matrix format , 2006, INTERSPEECH.

[13]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[14]  Shinsuke Sakai,et al.  A probabilistic approach to unit selection for corpus-based speech synthesis , 2005, INTERSPEECH.

[15]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[16]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[17]  Heiga Zen,et al.  The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006 , 2008, IEICE Trans. Inf. Syst..

[18]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[19]  Chin-Hui Lee,et al.  Structural maximum a posteriori linear regression for fast HMM adaptation , 2002, Comput. Speech Lang..

[20]  Mark J. F. Gales,et al.  Maximum likelihood multiple projection schemes for hidden Markov models , 1999 .

[21]  K. Tanaka,et al.  An acoustic model adaptation using HMM-based speech synthesis , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[22]  Takao Kobayashi,et al.  Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Junichi Yamagishi,et al.  Average-Voice-Based Speech Synthesis , 2006 .

[24]  Gérard Bailly,et al.  A new trainable trajectory formation system for facial animation , 2006, ExLing.

[25]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[26]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[27]  Heiga Zen,et al.  A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[28]  Heiga Zen,et al.  The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge , 2008 .

[29]  Heiga Zen,et al.  Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Keiichi Tokuda,et al.  Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[31]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[32]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[33]  Heiga Zen,et al.  A Bayesian approach to HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  智基 戸田,et al.  Recent developments of the HMM-based speech synthesis system (HTS) , 2007 .

[35]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[36]  Junichi Yamagishi,et al.  Speech driven head motion synthesis based on a trajectory model , 2007, SIGGRAPH '07.

[37]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[38]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[39]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[40]  Christian Weiss,et al.  Conditional random fields for hierarchical segment selection in text-to-speech synthesis , 2006, INTERSPEECH.

[41]  Frank K. Soong,et al.  A multi-space distribution (MSD) approach to speech recognition of tonal languages , 2006, INTERSPEECH.

[42]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[43]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005, Systems and Computers in Japan.

[44]  Masatsune Tamura,et al.  A Context Clustering Technique for Average Voice Models , 2002 .

[45]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[46]  Frank K. Soong,et al.  Automatic Detection of Tone Mispronunciation in Mandarin , 2006, ISCSLP.

[47]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[48]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[49]  Alexander I. Rudnicky,et al.  A constrained baum-welch algorithm for improved phoneme segmentation and efficient training , 2006, INTERSPEECH.

[50]  Heiga Zen,et al.  Reformulating the HMM as a Trajectory Model , 2004 .

[51]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[52]  Cyril Allauzen,et al.  Statistical Modeling for Unit Selection in Speech Synthesis , 2004, ACL.

[53]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..