Improved average-voice-based speech synthesis using gender-mixed modeling and a parameter generation algorithm considering GV

For constructing a speech synthesis system which can achieve diverse voices, we have been developing a speaker independent approach of HMM-based speech synthesis in which statistical average voice models are adapted to a target speaker using a small amount of speech data. In this paper, we incorporate a high-quality speech vocoding method STRAIGHT and a parameter generation algorithm with global variance into the system for improving quality of synthetic speech. Furthermore, we introduce a feature-space speaker adaptive training algorithm and a gender mixed modeling technique for conducting further normalization of the average voice model. We build an English text-to-speech system using these techniques and show the performance of the system.

[1]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[2]  Takao Kobayashi,et al.  Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis , 2006, INTERSPEECH.

[3]  Takao Kobayashi,et al.  Robust F0 Estimation of Speech Signal Using Harmonicity Measure Based on Instantaneous Frequency , 2004, IEICE Trans. Inf. Syst..

[4]  Takao Kobayashi,et al.  Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[5]  Takao Kobayashi,et al.  Model Adaptation Approach to Speech Synthesis with Diverse Voices and Styles , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[7]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[8]  Heiga Zen,et al.  State Duration Modeling for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[9]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[10]  Takao Kobayashi,et al.  Model adaptation and adaptive training using ESAT algorithm for HMM-based speech synthesis , 2005, INTERSPEECH.

[11]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[12]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[13]  Takao Kobayashi,et al.  Acoustic model training based on linear transformation and MAP modification for HSMM-based speech synthesis , 2006, INTERSPEECH.

[14]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[15]  Takao Kobayashi,et al.  HSMM-Based Model Adaptation Algorithms for Average-Voice-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[17]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[18]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .