论文信息 - Statistical parametric speech synthesis based on product of experts

Statistical parametric speech synthesis based on product of experts

Multiple-level acoustic models (AMs) are often combined in statistical parametric speech synthesis. Both linear and non-linear functions of the observation sequence are used as features in these AMs. This combination of multiple-level AMs can be expressed as a product of experts (PoE); the likelihoods from the AMs are scaled, multiplied together and then normalized. Currently these multiple-level AMs are individually trained and only combined at the synthesis stage. This paper discusses a more consistent PoE framework where the AMs are jointly trained. A generalization of trajectory HMM training can be used for multiple-level Gaussian AMs based on linear functions. However for the non-linear case this is not possible, so a scheme based on contrastive divergence learning is described. Experimental results show that the proposed technique provides both a mathematically elegant way to train multiple-level AMs and statistically significant improvements in the quality of synthesized speech.

[1] Heiga Zen,et al. Estimating Trajectory Hmm Parameters Using Monte Carlo Em With Gibbs Sampler , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2] Heiga Zen,et al. Reformulating the HMM as a Trajectory Model , 2004 .

[3] Heiga Zen,et al. Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4] Philipp Slusallek,et al. Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[5] 全炳河,et al. Reformulating HMM as a trajectory model by imposing explicit relationships between static and dynamic features , 2006 .

[6] Keiichi Tokuda,et al. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[7] Masami Akamine,et al. Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.

[8] Geoffrey E. Hinton,et al. Wormholes Improve Contrastive Divergence , 2003, NIPS.

[9] Christopher K. I. Williams. How to Pretend That Correlated Variables Are Independent by Using Difference Observations , 2005, Neural Computation.

[10] Zhizheng Wu,et al. Duration refinement by jointly optimizing state and longer unit likelihood , 2008, INTERSPEECH.

[11] Heiga Zen,et al. The HTS-2008 System: Yet Another Evaluation of the Speaker-Adaptive HMM-based Speech Synthesis System in The 2008 Blizzard Challenge , 2008 .

[12] Max Welling,et al. Product of experts , 2007, Scholarpedia.

[13] Keiichi Tokuda,et al. A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[14] Heiga Zen,et al. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[15] Radford M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[16] Ren-Hua Wang,et al. USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006 .

[17] Zhizheng Wu,et al. Improved prosody generation by maximizing joint likelihood of state and longer units , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18] Li-Rong Dai,et al. Multi-Layer F0 Modeling for HMM-Based Speech Synthesis , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[19] Heiga Zen,et al. Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[20] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.