Parameter Generation Methods With Rich Context Models for High-Quality and Flexible Text-To-Speech Synthesis

In this paper, we propose parameter generation methods using rich context models as yet another hybrid method combining Hidden Markov Model (HMM)-based speech synthesis and unit selection synthesis. Traditional HMM-based speech synthesis enables flexible modeling of acoustic features based on a statistical approach. However, the speech parameters tend to be excessively smoothed. To address this problem, several hybrid methods combining HMM-based speech synthesis and unit selection synthesis have been proposed. Although they significantly improve quality of synthetic speech, they usually lose flexibility of the original HMM-based speech synthesis. In the proposed methods, we use rich context models, which are statistical models that represent individual acoustic parameter segments. In training, the rich context models are reformulated as Gaussian Mixture Models (GMMs). In synthesis, initial speech parameters are generated from probability distributions over-fitted to individual segments, and the speech parameter sequence is iteratively generated from GMMs using a parameter generation method based on the maximum likelihood criterion. Since the basic framework of the proposed methods is still the same as the traditional framework, the capability of flexibly modeling acoustic features remains. The experimental results demonstrate: (1) the use of approximation with a single Gaussian component sequence yields better synthetic speech quality than the use of EM algorithm in the proposed parameter generation method, (2) the state-based model selection yields quality improvements at the same level as the frame-based model selection, (3) the use of the initial parameters generated from the over-fitted speech probability distributions is very effective to further improve speech quality, and (4) the proposed methods for spectral and F0 components yields significant improvements in synthetic speech quality compared with the traditional HMM-based speech synthesis.

[1]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[3]  Mari Ostendorf,et al.  Cross-validation and aggregated EM training for robust parameter estimation , 2008, Comput. Speech Lang..

[4]  Shigeru Katagiri,et al.  A large-scale Japanese speech database , 1990, ICSLP.

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Heng Lu,et al.  The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 , 2007 .

[7]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[9]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Keiichi Tokuda,et al.  Decision-tree backing-off in HMM-based speech synthesis , 2004, INTERSPEECH.

[11]  Matthew J. Makashay,et al.  Corpus-based techniques in the AT&t nextgen synthesis system , 2000, INTERSPEECH.

[12]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[13]  Zhi-Jie Yan,et al.  Rich context modeling for high quality HMM-based TTS , 2009, INTERSPEECH.

[14]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[15]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[16]  Tomoki Toda,et al.  Optimizing sub-cost functions for segment selection based on perceptual evaluations in concatenative speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[18]  Tatsuya Mizutani,et al.  Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method , 2005, IEICE Trans. Inf. Syst..

[19]  S. King,et al.  The Blizzard Challenge 2011 , 2011 .

[20]  Tomoki Toda,et al.  Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation , 2006, INTERSPEECH.

[21]  Zhi-Jie Yan,et al.  An HMM trajectory tiling (HTT) approach to high quality TTS , 2010, INTERSPEECH.

[22]  Zhen-Hua Ling,et al.  DNN-based unit selection using frame-sized speech segments , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[23]  N. Iwahashi,et al.  Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization , 1993 .

[24]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[25]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[26]  Keiichi Tokuda,et al.  Speaker interpolation in HMM-based speech synthesis system , 1997, EUROSPEECH.

[27]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[28]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .