MINIMUM GENERATION ERROR LINEAR REGRESSION BASED MODEL ADAPTATION FOR HMM-BASED SPEECH SYNTHESIS

Due to the inconsistency between the maximum likelihood (ML) based training and the synthesis application in HMM-based speech synthesis, a minimum generation error (MGE) criterion had been proposed for HMM training. This paper continues to apply the MGE criterion to model adaptation for HMM-based speech synthesis. We propose a MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models to target models are optimized to minimize the generation errors for the input speech data uttered by the target speaker. The proposed MGELR approach was compared with the maximum likelihood linear regression (MLLR) based model adaptation. Experimental results indicate that the generation errors were reduced after the MGELR-based model adaptation. And from the subjective listening test, the discrimination and the quality of the synthesized speech using MGELR were better than the results using MLLR. Index Terms — Speech synthesis, minimum generation error, linear regression, model adaptation

[1]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[2]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[3]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[4]  T. Masuko Speech synthesis from HMMs using dynamic features , 1996 .

[5]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[6]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[7]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[8]  Takao Kobayashi,et al.  Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Wu Guo,et al.  Minimum generation error criterion for tree-based clustering of context dependent HMMs , 2006, INTERSPEECH.

[10]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Ren-Hua Wang,et al.  Improving the performance of HMM-based voice conversion using context clustering decision tree and appropriate regression matrix format , 2006, INTERSPEECH.

[12]  Takao Kobayashi,et al.  Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis , 2006, INTERSPEECH.

[13]  Frank K. Soong,et al.  Full HMM Training for Minimizing Generation Error in Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.