The HMM-based speech synthesis system (HTS) version 2.0

A statistical parametric speech synthesis system based on hidden Markov models (HMMs) has grown in popularity over the last few years. This system simultaneously models spectrum, excitation, and duration of speech using context-dependent HMMs and generates speech waveforms from the HMMs themselves. Since December 2002, we have publicly released an open-source software toolkit named HMM-based speech synthesis system (HTS) to provide a research and development platform for the speech synthesis community. In December 2006, HTS version 2.0 was released. This version includes a number of new features which are useful for both speech synthesis researchers and developers. This paper describes HTS version 2.0 in detail, as well as future release plans.

[1]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[2]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[4]  Alan W. Black,et al.  CHATR: a generic speech synthesis system , 1994, COLING.

[5]  Philip C. Woodland,et al.  Automatic speech synthesiser parameter estimation using HMMs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[6]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[7]  Mark J. F. Gales,et al.  The generation and use of regression class trees for MLLR adaptation , 1996 .

[8]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[9]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[11]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[12]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[13]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[14]  Mark J. F. Gales,et al.  Semi-tied covariance matrices for hidden Markov models , 1999, IEEE Trans. Speech Audio Process..

[15]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[16]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[17]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[18]  Mark J. F. Gales,et al.  Maximum likelihood multiple projection schemes for hidden Markov models , 1999 .

[19]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[20]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[22]  Keiichi Tokuda,et al.  Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[23]  Masatsune Tamura,et al.  A Context Clustering Technique for Average Voice Models , 2003 .

[24]  Keiichi Tokuda,et al.  Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[25]  Mari Ostendorf,et al.  The impact of speech recognition on speech synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[26]  Alan W. Black Unit selection and emotional speech , 2003, INTERSPEECH.

[27]  K. Tanaka,et al.  An acoustic model adaptation using HMM-based speech synthesis , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[28]  Heiga Zen,et al.  Improving the performance of HMM-based very low bit rate speech coding , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[29]  Keikichi Hirose,et al.  Galatea : An Anthropomorphic Spoken Dialogue Agent Toolkit , 2003 .

[30]  Michael Picheny,et al.  A corpus-based approach to expressive speech synthesis , 2004, SSW.

[31]  R. Bakis,et al.  A CORPUS-BASED APPROACH TO < AHEM / > EXPRESSIVE SPEECH SYNTHESIS , 2004 .

[32]  Heiga Zen,et al.  Reformulating the HMM as a Trajectory Model , 2004 .

[33]  Takao Kobayashi,et al.  Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[34]  France Mihelic,et al.  Evaluation of the Slovenian HMM-Based Speech Synthesis System , 2004, TSD.

[35]  Simon King,et al.  8th ISCA Workshop on Speech Synthesis , 2004 .

[36]  Takao Kobayashi,et al.  Human Walking Motion Synthesis with Desired Pace and Stride Length Based on HSMM , 2005, IEICE Trans. Inf. Syst..

[37]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005 .

[38]  Sadaoki Furui,et al.  Polyglot synthesis using a mixture of monolingual corpora , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[39]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[40]  Keiichi Tokuda,et al.  HMM-based european Portuguese TTS system , 2005, INTERSPEECH.

[41]  Takao Kobayashi,et al.  Model adaptation and adaptive training using ESAT algorithm for HMM-based speech synthesis , 2005, INTERSPEECH.

[42]  Takao Kobayashi,et al.  Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing , 2005, IEICE Trans. Inf. Syst..

[43]  Takashi Nose,et al.  A style control technique for speech synthesis using multiple regression HSMM , 2006, Interspeech.

[44]  Junichi Yamagishi,et al.  Average-Voice-Based Speech Synthesis , 2006 .

[45]  Teknillinen Korkeakoulu,et al.  Auditory quality evaluation of present Finnish text-to-speech systems , 2006 .

[46]  Ivo Ipsic,et al.  Croatian HMM based speech synthesis , 2006, 28th International Conference on Information Technology Interfaces, 2006..

[47]  Wu Yi-jian HMM-based Trainable Speech Synthesis for Chinese , 2006 .

[48]  Gérard Bailly,et al.  A new trainable trajectory formation system for facial animation , 2006, ExLing.

[49]  Frank K. Soong,et al.  A multi-space distribution (MSD) approach to speech recognition of tonal languages , 2006, INTERSPEECH.

[50]  Jong-Jin Kim,et al.  Implementation and Evaluation of an HMM-Based Korean Speech Synthesis System , 2006, IEICE Trans. Inf. Syst..

[51]  Korin Richmond,et al.  A trajectory mixture density network for the acoustic-articulatory inversion mapping , 2006, INTERSPEECH.

[52]  Sherif Abdou,et al.  Improving Arabic HMM based speech synthesis quality , 2006, INTERSPEECH.

[53]  Takao Kobayashi,et al.  A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features , 2006, IEICE Trans. Inf. Syst..

[54]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006, Blizzard Challenge.

[55]  Frank K. Soong,et al.  An HMM-Based Mandarin Chinese Text-To-Speech System , 2006, ISCSLP.

[56]  Takao Kobayashi,et al.  Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis , 2006, INTERSPEECH.

[57]  Alexander I. Rudnicky,et al.  A constrained baum-welch algorithm for improved phoneme segmentation and efficient training , 2006, INTERSPEECH.

[58]  Takao Kobayashi,et al.  HSMM-Based Model Adaptation Algorithms for Average-Voice-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[59]  Takao Kobayashi,et al.  Implementation and evaluation of an HMM-based Thai speech synthesis system , 2007, INTERSPEECH.

[60]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[61]  Heiga Zen,et al.  State Duration Modeling for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[62]  Sacha Krstulovic,et al.  An HMM-based speech synthesis system applied to German and its adaptation to a limited set of expressive football announcements , 2007, INTERSPEECH.

[63]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[64]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[65]  Junichi Yamagishi,et al.  Speech driven head motion synthesis based on a trajectory model , 2007, SIGGRAPH '07.

[66]  智基 戸田,et al.  Recent developments of the HMM-based speech synthesis system (HTS) , 2007 .

[67]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[68]  Frank K. Soong,et al.  A MSD-HMM Approach to Pen Trajectory Modeling for Online Handwriting Recognition , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[69]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[70]  Xia Wang,et al.  A Novel HMM-Based TTS System using Both Continuous HMMS and Discrete HMMS , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[71]  Heiga Zen,et al.  The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006 , 2006, IEICE Trans. Inf. Syst..

[72]  Heiga Zen,et al.  A Bayesian approach to HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.