An improved minimum generation error based model adaptation for HMM-based speech synthesis

A minimum generation error (MGE) criterion had been proposed for model training in HMM-based speech synthesis. In this paper, we apply the MGE criterion to model adaptation for HMM-based speech synthesis, and introduce an MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models are optimized so as to minimize the generation errors of adaptation data. In addition, we incorporate the recent improvements of MGE criterion into MGELR-based model adaptation, including state alignment under MGE criterion and using a log spectral distortion (LSD) instead of Euclidean distance for spectral distortion measure. From the experimental results, the adaptation performance was improved after incorporating these two techniques, and the formal listening tests showed that the quality and speaker similarity of synthesized speech after MGELRbased adaptation were significantly improved over the original MLLR-based adaptation. Index Terms: Speech synthesis, HMM, speaker adaptation, minimum generation error, linear regression

[1]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  Heiga Zen,et al.  Speaker-Independent HMM-based Speech Synthesis System: HTS-2007 System for the Blizzard Challenge 2007 , 2007 .

[3]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[4]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[5]  Keiichi Tokuda,et al.  Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis , 2008, INTERSPEECH.

[6]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[9]  Takao Kobayashi,et al.  HSMM-Based Model Adaptation Algorithms for Average-Voice-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  Lirong Dai,et al.  MINIMUM GENERATION ERROR LINEAR REGRESSION BASED MODEL ADAPTATION FOR HMM-BASED SPEECH SYNTHESIS , 2008 .

[11]  K. Tokuda,et al.  A Training Method of Average Voice Model for HMM-Based Speech Synthesis , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[12]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[13]  Ren-Hua Wang,et al.  Improving the performance of HMM-based voice conversion using context clustering decision tree and appropriate regression matrix format , 2006, INTERSPEECH.

[14]  Keiichi Tokuda,et al.  Speaker adaptation for HMM-based speech synthesis system using MLLR , 1998, SSW.

[15]  F. Itakura Line spectrum representation of linear predictor coefficients of speech signals , 1975 .