论文信息 - Product of Experts for Statistical Parametric Speech Synthesis

Product of Experts for Statistical Parametric Speech Synthesis

Multiple acoustic models are often combined in statistical parametric speech synthesis. Both linear and non-linear functions of an observation sequence are used as features to be modeled. This paper shows that this combination of multiple acoustic models can be expressed as a product of experts (PoE); the likelihoods from the models are scaled, multiplied together, and then normalized. Normally these models are individually trained and only combined at the synthesis stage. This paper discusses a more consistent PoE framework where the models are jointly trained. A training algorithm for PoEs based on linear feature functions and Gaussian experts is derived by generalizing the training algorithm for trajectory HMMs. However for non-linear feature functions or non-Gaussian experts this is not possible, so a scheme based on contrastive divergence learning is described. Experimental results show that the PoE framework provides both a mathematically elegant way to train multiple acoustic models jointly and significant improvements in the quality of the synthesized speech.

[1] J. Laurie Snell,et al. Markov Random Fields and Their Applications , 1980 .

[2] Sadaoki Furui,et al. Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[3] Geoffrey E. Hinton,et al. Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[4] Keiichi Tokuda,et al. An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] Jj Odell,et al. The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[6] Mari Ostendorf,et al. From HMM's to segment models: a unified view of stochastic modeling for speech recognition , 1996, IEEE Trans. Speech Audio Process..

[7] Keiichi Tokuda,et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[8] Geoffrey E. Hinton. Products of experts , 1999 .

[9] Keiichi Tokuda,et al. Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10] Christopher K. I. Williams,et al. Products of Gaussians , 2001, NIPS.

[11] Sridha Sridharan,et al. Trainable speech synthesis with trended hidden Markov models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[13] Geoffrey E. Hinton,et al. Wormholes Improve Contrastive Divergence , 2003, NIPS.

[14] Yee Whye Teh,et al. Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[15] Mark J. F. Gales,et al. Basis superposition precision matrix modelling for large vocabulary continuous speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16] Christopher K. I. Williams. How to Pretend That Correlated Variables Are Independent by Using Difference Observations , 2005, Neural Computation.

[17] Philipp Slusallek,et al. Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[18] Leonhard Held,et al. Gaussian Markov Random Fields: Theory and Applications , 2005 .

[19] Heiga Zen,et al. Estimating Trajectory Hmm Parameters Using Monte Carlo Em With Gibbs Sampler , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[21] Heiga Zen,et al. Speaker adaptation of trajectory HMMs using feature-space MLLR , 2006, INTERSPEECH.

[22] Ren-Hua Wang,et al. USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006 .

[23] Mark J. F. Gales,et al. Product of Gaussians for speech recognition , 2006, Comput. Speech Lang..

[24] 全炳河,et al. Reformulating HMM as a trajectory model by imposing explicit relationships between static and dynamic features , 2006 .

[25] Heiga Zen,et al. Model-space MLLR for trajectory HMMs , 2007, INTERSPEECH.

[26] Max Welling Donald,et al. Products of Experts , 2007 .

[27] Max Welling,et al. Product of experts , 2007, Scholarpedia.

[28] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29] Radford M. Neal. Pattern Recognition and Machine Learning , 2007, Technometrics.

[30] Heiga Zen,et al. Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[31] Keiichi Tokuda,et al. A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[32] Heiga Zen,et al. A Hidden Semi-Markov Model-Based Speech Synthesis System , 2007, IEICE Trans. Inf. Syst..

[33] Heiga Zen,et al. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[34] Masami Akamine,et al. Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.

[35] Li-Rong Dai,et al. Multi-Layer F0 Modeling for HMM-Based Speech Synthesis , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[36] Heiga Zen,et al. The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge , 2008 .

[37] Heiga Zen,et al. Performance evaluation of the speaker-independent HMM-based speech synthesis system “HTS 2007” for the Blizzard Challenge 2007 , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[38] William J. Byrne,et al. Autoregressive HMMs for speech synthesis , 2009, INTERSPEECH.

[39] Keiichi Tokuda,et al. Full covariance state duration modeling for HMM-based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40] Feng Ding,et al. A polynomial segment model based statistical parametric speech synthesis sytem , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41] Heiga Zen,et al. Statistical parametric speech synthesis based on product of experts , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[42] Oliver Watts,et al. The CSTR/EMIME HTS system for Blizzard Challenge 2010 , 2010 .

[43] Daphne Koller,et al. Non-Local Contrastive Objectives , 2010, ICML.

[44] Radford M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[45] Heiga Zen,et al. Estimation of Window Coefficients for Dynamic Feature Extraction for HMM-Based Speech Synthesis , 2011, INTERSPEECH.

[46] Zhizheng Wu,et al. Improved Prosody Generation by Maximizing Joint Probability of State and Longer Units , 2011, IEEE Transactions on Audio, Speech, and Language Processing.