Autoregressive Models for Statistical

We propose using the autoregressive hidden Markov model (HMM) for speech synthesis. The autoregressive HMM uses the same model for parameter estimation and synthesis in a consistent way, in contrast to the standard approach to statis- tical parametric speech synthesis. It supports easy and efficient parameter estimation using expectation maximization, in contrast to the trajectory HMM. At the same time its similarities to the standard approach allow use of established high quality synthe- sis algorithms such as speech parameter generation considering global variance. The autoregressive HMM also supports a speech parameter generation algorithm not available for the standard approach or the trajectory HMM and which has particular ad- vantages in the domain of real-time, low latency synthesis. We show how to do efficient parameter estimation and synthesis with the autoregressive HMM and look at some of the similarities and differences between the standard approach, the trajectory HMM and the autoregressive HMM. We compare the three approaches in subjective and objective evaluations. We also systematically investigate which choices of parameters such as autoregressive order and number of states are optimal for the autoregressive HMM.

[1]  Carl Quillen,et al.  Autoregressive HMM speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  William J. Byrne,et al.  Autoregressive HMMs for speech synthesis , 2009, INTERSPEECH.

[3]  Biing-Hwang Juang,et al.  Mixture autoregressive hidden Markov models for speech signals , 1985, IEEE Trans. Acoust. Speech Signal Process..

[4]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  H. Kobayashi,et al.  An efficient forward-backward algorithm for an explicit-duration hidden Markov model , 2003, IEEE Signal Processing Letters.

[7]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[9]  Keiichi Tokuda,et al.  Incorporating a mixed excitation model and postfilter into HMM-based text-to-speech synthesis , 2005 .

[10]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Philip C. Woodland,et al.  Hidden Markov models using vector linear prediction and discriminative output distributions , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Heiga Zen,et al.  The Effect of Using Normalized Models in Statistical Speech Synthesis , 2011, INTERSPEECH.

[13]  Matt Shannon,et al.  A formulation of the autoregressive HMM for speech synthesis , 2009 .

[14]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[15]  R. Kubichek,et al.  Mel-cepstral distance measure for objective speech quality assessment , 1993, Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing.

[16]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[18]  Keiichi Tokuda,et al.  Vector Quantization of Speech Spectral Parameters Using Statistics of Static and Dynamic Features , 2001 .

[19]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[20]  Philip C. Woodland,et al.  Maximum mutual information training of hidden Markov models with vector linear predictors , 2002, INTERSPEECH.

[21]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[22]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[23]  Frank K. Soong,et al.  Improved minimum converted trajectory error training for real-time speech-to-lips conversion , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Simon King,et al.  Statistical analysis of the Blizzard Challenge 2007 listening test results , 2007 .

[25]  Patrick Kenny,et al.  A linear predictive HMM for vector-valued observations with applications to speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[26]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  H. Zen IMPLEMENTING AN HSMM-BASED SPEECH SYNTHESIS SYSTEM USING AN EFFICIENT FORWARD-BACKWARD ALGORITHM , 2007 .

[28]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[29]  A. B. Poritz,et al.  Linear predictive hidden Markov models and the speech signal , 1982, ICASSP.

[30]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[31]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[32]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[33]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006 .