Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters

This paper describes a novel framework for statistical parametric speech synthesis in which statistical modeling of the speech waveform is performed through the joint estimation of acoustic and excitation model parameters. The proposed method combines extraction of spectral parameters, considered as hidden variables, and excitation signal modeling in a fashion similar to factor analyzed trajectory hidden Markov model. The resulting joint model can be interpreted as a waveform level closed-loop training, where the distance between natural and synthesized speech is minimized. An algorithm based on the maximum likelihood criterion is introduced to train the proposed joint model and some experiments are presented to show its effectiveness.

[1]  Keiichi Tokuda,et al.  Minimum generation error training by using original spectrum as reference for log spectral distortion measure , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[3]  Keiichi Tokuda,et al.  Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[6]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[7]  Heiga Zen,et al.  An excitation model for HMM-based speech synthesis based on residual modeling , 2007, SSW.

[8]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.