Product of Experts for Statistical Parametric Speech Synthesis

Multiple acoustic models are often combined in statistical parametric speech synthesis. Both linear and non-linear functions of an observation sequence are used as features to be modeled. This paper shows that this combination of multiple acoustic models can be expressed as a product of experts (PoE); the likelihoods from the models are scaled, multiplied together, and then normalized. Normally these models are individually trained and only combined at the synthesis stage. This paper discusses a more consistent PoE framework where the models are jointly trained. A training algorithm for PoEs based on linear feature functions and Gaussian experts is derived by generalizing the training algorithm for trajectory HMMs. However for non-linear feature functions or non-Gaussian experts this is not possible, so a scheme based on contrastive divergence learning is described. Experimental results show that the PoE framework provides both a mathematically elegant way to train multiple acoustic models jointly and significant improvements in the quality of the synthesized speech.

[1]  J. Laurie Snell,et al.  Markov Random Fields and Their Applications , 1980 .

[2]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[3]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[4]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[6]  Mari Ostendorf,et al.  From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[7]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[8]  Geoffrey E. Hinton Products of experts , 1999 .

[9]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  Christopher K. I. Williams,et al.  Products of Gaussians , 2001, NIPS.

[11]  Sridha Sridharan,et al.  Trainable speech synthesis with trended hidden Markov models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[13]  Geoffrey E. Hinton,et al.  Wormholes Improve Contrastive Divergence , 2003, NIPS.

[14]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[15]  Mark J. F. Gales,et al.  Basis superposition precision matrix modelling for large vocabulary continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Christopher K. I. Williams How to Pretend That Correlated Variables Are Independent by Using Difference Observations , 2005, Neural Computation.

[17]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[18]  Leonhard Held,et al.  Gaussian Markov Random Fields: Theory and Applications , 2005 .

[19]  Heiga Zen,et al.  Estimating Trajectory Hmm Parameters Using Monte Carlo Em With Gibbs Sampler , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[21]  Heiga Zen,et al.  Speaker adaptation of trajectory HMMs using feature-space MLLR , 2006, INTERSPEECH.

[22]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006 .

[23]  Mark J. F. Gales,et al.  Product of Gaussians for speech recognition , 2006, Comput. Speech Lang..

[24]  全 炳河,et al.  Reformulating HMM as a trajectory model by imposing explicit relationships between static and dynamic features , 2006 .

[25]  Heiga Zen,et al.  Model-space MLLR for trajectory HMMs , 2007, INTERSPEECH.

[26]  Max Welling Donald,et al.  Products of Experts , 2007 .

[27]  Max Welling,et al.  Product of experts , 2007, Scholarpedia.

[28]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[30]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[31]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[32]  Heiga Zen,et al.  A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[33]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[34]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.

[35]  Li-Rong Dai,et al.  Multi-Layer F0 Modeling for HMM-Based Speech Synthesis , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[36]  Heiga Zen,et al.  The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge , 2008 .

[37]  Heiga Zen,et al.  Performance evaluation of the speaker-independent HMM-based speech synthesis system “HTS 2007” for the Blizzard Challenge 2007 , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38]  William J. Byrne,et al.  Autoregressive HMMs for speech synthesis , 2009, INTERSPEECH.

[39]  Keiichi Tokuda,et al.  Full covariance state duration modeling for HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Feng Ding,et al.  A polynomial segment model based statistical parametric speech synthesis sytem , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Heiga Zen,et al.  Statistical parametric speech synthesis based on product of experts , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Oliver Watts,et al.  The CSTR/EMIME HTS system for Blizzard Challenge 2010 , 2010 .

[43]  Daphne Koller,et al.  Non-Local Contrastive Objectives , 2010, ICML.

[44]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[45]  Heiga Zen,et al.  Estimation of Window Coefficients for Dynamic Feature Extraction for HMM-Based Speech Synthesis , 2011, INTERSPEECH.

[46]  Zhizheng Wu,et al.  Improved Prosody Generation by Maximizing Joint Probability of State and Longer Units , 2011, IEEE Transactions on Audio, Speech, and Language Processing.