Speech Synthesis Based on Hidden Markov Models

This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are described.

[1]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[2]  Soufiane Rouibia,et al.  Unit selection for speech synthesis based on a new acoustic target cost , 2005, INTERSPEECH.

[3]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[4]  Paavo Alku,et al.  The GlottHMM Speech Synthesis Entry for Blizzard Challenge 2010 , 2010 .

[5]  D. Pisoni,et al.  Speech Synthesis, Perception and Comprehension of , 2006 .

[6]  Toshio Hirai,et al.  Using 5 ms segments in concatenative speech synthesis , 2004, SSW.

[7]  Heiga Zen,et al.  Acoustic modeling with contextual additive structure for HMM-based speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Dennis H. Klatt,et al.  Software for a cascade/parallel formant synthesizer , 1980 .

[9]  Alan W. Black Unit selection and emotional speech , 2003, INTERSPEECH.

[10]  Heiga Zen,et al.  Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters , 2010, SSW.

[11]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[12]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[13]  Yoshinori Sagisaka,et al.  Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks , 1995, Speech Commun..

[14]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[15]  YoungSteve,et al.  The application of hidden Markov models in speech recognition , 2007 .

[16]  Junichi Yamagishi,et al.  Combining vocal tract length normalization with hierarchial linear transformations , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  S. King,et al.  Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction , 2012 .

[18]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[19]  Junichi Yamagishi,et al.  Modeling and interpolation of Austrian German and Viennese dialect in HMM-based speech synthesis , 2010, Speech Commun..

[20]  Tetsunori Kobayashi,et al.  Hybrid Voice Conversion of Unit Selection and Generation Using Prosody Dependent HMM , 2006, IEICE Trans. Inf. Syst..

[21]  Simon King,et al.  The Blizzard Challenge 2008 , 2008 .

[22]  Hirokazu Kameoka,et al.  A statistical model of speech F0 contours , 2010, SAPA@INTERSPEECH.

[23]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[24]  Heiga Zen,et al.  The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge , 2008 .

[25]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[26]  F. Itakura,et al.  Minimum prediction residual principle applied to speech recognition , 1975 .

[27]  Keiichi Tokuda,et al.  XIMERA: a new TTS from ATR based on corpus-based technologies , 2004, SSW.

[28]  Justin Fackrell,et al.  Segment selection in the L&h Realspeak laboratory TTS system , 2000, INTERSPEECH.

[29]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[30]  Yoshihiko Nankaku,et al.  Factor analyzed voice models for HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Keiichi Tokuda,et al.  Analysis of unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using KLD-based transform mapping , 2012, Speech Commun..

[32]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[33]  Jerome R. Bellegarda,et al.  A Data-Driven Affective Analysis Framework Toward Naturally Expressive Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Sadaoki Furui,et al.  Polyglot synthesis using a mixture of monolingual corpora , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[35]  Yuan Jiang,et al.  Multi-tier Non-uniform Unit Selection for Corpus-based Speech Synthesis , 2006 .

[36]  Enrico Zovato,et al.  Speech synthesis enhancement in noisy environments , 2007, INTERSPEECH.

[37]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Alan W. Black,et al.  Data-driven phrasing for speech synthesis in low-resource languages , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Heiga Zen,et al.  Product of Experts for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Alex Acero,et al.  Automatic generation of synthesis units for trainable text-to-speech systems , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[41]  Heiga Zen,et al.  A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[42]  Heiga Zen,et al.  A Bayesian approach to HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[43]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[44]  Philip C. Woodland,et al.  Improvements in an HMM-based speech synthesiser , 1995, EUROSPEECH.

[45]  Takashi Nose,et al.  Speaker and style adaptation using average voice model for style control in HMM-based speech synthesis , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[47]  Thierry Dutoit,et al.  Continuous Control of the Degree of Articulation in HMM-Based Speech Synthesis , 2011, INTERSPEECH.

[48]  Naonori Ueda,et al.  Variational bayesian estimation and clustering for speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[49]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[50]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[51]  Shigeki Sagayama,et al.  Multiple-regression hidden Markov model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[52]  Ren-Hua Wang,et al.  HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[53]  Yoshihiko Nankaku,et al.  Simultaneous Acoustic, Prosodic, and Phrasing Model Training for TTs Conversion Systems , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[54]  Heiga Zen,et al.  The Effect of Using Normalized Models in Statistical Speech Synthesis , 2011, INTERSPEECH.

[55]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[56]  Heiga Zen,et al.  An HMM-based singing voice synthesis system , 2006, INTERSPEECH.

[57]  Takao Kobayashi,et al.  Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing , 2005, IEICE Trans. Inf. Syst..

[58]  Michael Picheny,et al.  A corpus-based approach to expressive speech synthesis , 2004, SSW.

[59]  Paavo Alku,et al.  HMM-based Finnish text-to-speech system utilizing glottal inverse filtering , 2008, INTERSPEECH.

[60]  Sherif Abdou,et al.  Improving Arabic HMM based speech synthesis quality , 2006, INTERSPEECH.

[61]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[62]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[63]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[64]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[65]  Mari Ostendorf,et al.  A dynamical system model for generating F0 for synthesis , 1994, SSW.

[66]  R. Bakis,et al.  A CORPUS-BASED APPROACH TO < AHEM / > EXPRESSIVE SPEECH SYNTHESIS , 2004 .

[67]  Paul Taylor Unifying unit selection and hidden Markov model speech synthesis , 2006, INTERSPEECH.

[68]  Xia Wang,et al.  A Novel HMM-Based TTS System using Both Continuous HMMS and Discrete HMMS , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[69]  Paul Dalsgaard,et al.  Modelling intonation contours at the phrase level using continuous density hidden Markov models , 1994, Comput. Speech Lang..

[70]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[71]  Keiichi Tokuda,et al.  Eigenvoices for HMM-based speech synthesis , 2002, INTERSPEECH.

[72]  K. Tokuda,et al.  A Training Method of Average Voice Model for HMM-Based Speech Synthesis , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[73]  Keiichi Tokuda,et al.  Voice characteristics conversion for HMM-based speech synthesis system , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[74]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[75]  Keiichi Tokuda,et al.  Decision-tree backing-off in HMM-based speech synthesis , 2004, INTERSPEECH.

[76]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[77]  Frank Fallside,et al.  Lexical stress recognition using hidden Markov models , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[78]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[79]  Keiichi Tokuda,et al.  Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[80]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[81]  Thierry Dutoit,et al.  Using a pitch-synchronous residual codebook for hybrid HMM/frame selection speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[82]  Heiga Zen,et al.  An excitation model for HMM-based speech synthesis based on residual modeling , 2007, SSW.

[83]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[84]  Paavo Alku,et al.  HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[85]  P. Schönle,et al.  Electromagnetic articulography: Use of alternating magnetic fields for tracking movements of multiple points inside and outside the vocal tract , 1987, Brain and Language.

[86]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[87]  A. Bonafonte,et al.  FLEXIBLE HARMONIC / STOCHASTIC MODELING FOR HMM-BASED SPEECH SYNTHESIS , 2008 .

[88]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[89]  Alan W. Black,et al.  The Blizzard Challenge 2006 , 2006 .

[90]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .

[91]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[92]  Roger K. Moore,et al.  Reactive Speech Synthesis: Actively Managing Phonetic Contrast along an H&H Continuum , 2011, ICPhS.

[93]  Keiichi Tokuda,et al.  Vector Quantization of Speech Spectral Parameters Using Statistics of Static and Dynamic Features , 2001 .

[94]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[95]  Takashi Nose,et al.  HMM-Based Style Control for Expressive Speech Synthesis with Arbitrary Speaker's Voice Using Model Adaptation , 2009, IEICE Trans. Inf. Syst..

[96]  Simon King,et al.  The Blizzard Challenge 2007 , 2007 .

[97]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[98]  Simon King,et al.  The Blizzard Challenge 2009 , 2009 .

[99]  Takao Kobayashi,et al.  A style control technique for HMM-based speech synthesis , 2004, INTERSPEECH.

[100]  Minsoo Hahn,et al.  Two-Band Excitation for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[101]  Kai Yu,et al.  Word-level emphasis modelling in HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[102]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[103]  Zhen-Hua Ling,et al.  Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[104]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[105]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[106]  Keiichi Tokuda,et al.  Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[107]  Toshio Hirai,et al.  An MRI‐based time‐domain speech synthesis system , 2006 .

[108]  Joan Claudi Socoró,et al.  Linguistic and mixed excitation improvements on a HMM-based speech synthesis for Castilian Spanish , 2007, SSW.

[109]  Paavo Alku,et al.  Analysis of HMM-Based Lombard Speech Synthesis , 2011, INTERSPEECH.

[110]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[111]  Takashi Nose,et al.  A Speaker Adaptation Technique for MRHSMM-Based Style Control of Synthetic Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[112]  Philip C. Woodland Speaker adaptation for continuous density HMMs: a review , 2001 .

[113]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006 .

[114]  Heiga Zen,et al.  A Covariance-Tying Technique for HMM-Based Speech Synthesis , 2010, IEICE Trans. Inf. Syst..

[115]  Heng Lu,et al.  The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 , 2007 .

[116]  Junichi Yamagishi,et al.  Towards an improved modeling of the glottal source in statistical parametric speech synthesis , 2007, SSW.

[117]  Peter Jackson,et al.  A phonologically motivated method of selecting non-uniform units , 1998, ICSLP.

[118]  Ren-Hua Wang,et al.  Integrating Articulatory Features Into HMM-Based Parametric Speech Synthesis , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[119]  Takao Kobayashi,et al.  Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[120]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[121]  Jong-Jin Kim,et al.  HMM-based Korean speech synthesis system for hand-held devices , 2006, IEEE Transactions on Consumer Electronics.

[122]  Frank K. Soong,et al.  A Cross-Language State Sharing and Mapping Approach to Bilingual (Mandarin–English) TTS , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[123]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[124]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[125]  서종수,et al.  四季 引 festival , 2009 .

[126]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[127]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[128]  Junichi Yamagishi,et al.  Glottal spectral separation for parametric speech synthesis , 2008, INTERSPEECH.

[129]  S. King,et al.  The Blizzard Challenge 2010 , 2010 .

[130]  Keiichi Tokuda,et al.  Minimum generation error training by using original spectrum as reference for log spectral distortion measure , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[131]  Simon King,et al.  Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[132]  Coralie Hemptinne Master Thesis: Integration of the Harmonic plus Noise Model (HNM) into the Hidden Markov Model-Based Speech Synthesis System (HTS) , 2006 .

[133]  Junichi Yamagishi,et al.  Identification of contrast and its emphatic realization in HMM based speech synthesis , 2009, INTERSPEECH.