Local minimum generation error criterion for hybrid HMM speech synthesis

This paper presents an HMM-driven hybrid speech synthesis approach in which unit selection concatenative synthesis is used to improve the quality of the statistical system using a Local Minimum Generation Error (LMGE) during the synthesis stage. The idea behind this approach is to combine the robustness due to HMMs with the naturalness of concatenated units. Unlike the conventional hybrid approaches to speech synthesis that use concatenative synthesis as a backbone, the proposed system employs stable regions of natural units to improve the statistically generated parameters. We show that this approach improves the generation of vocal tract parameters, smoothes the bad joints and increases the overall quality.

[1]  Wu Guo,et al.  Minimum generation error criterion for tree-based clustering of context dependent HMMs , 2006, INTERSPEECH.

[2]  Vincent Pollet,et al.  Synthesis by generation and concatenation of multiform segments , 2008, INTERSPEECH.

[3]  Keiichi Tokuda,et al.  XIMERA: a new TTS from ATR based on corpus-based technologies , 2004, SSW.

[4]  Michael W. Macon,et al.  Unit fusion for concatenative speech synthesis , 2000, INTERSPEECH.

[5]  Alex Acero,et al.  HMM-based smoothing for concatenative speech synthesis , 1998, ICSLP.

[6]  Ren-Hua Wang,et al.  Minimum unit selection error training for HMM-based unit selection speech synthesis system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Soufiane Rouibia,et al.  Unit selection for speech synthesis based on a new acoustic target cost , 2005, INTERSPEECH.

[8]  Heiga Zen,et al.  Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005 , 2007, IEICE Trans. Inf. Syst..

[9]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[10]  Heiga Zen,et al.  Hidden Semi-Markov Model Based Speech Synthesis System , 2006 .

[11]  Heng Lu,et al.  The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 , 2007 .

[12]  Junichi Yamagishi,et al.  Utilization of an HMM-based feature generation module in 5 ms segment concatenative speech synthesis , 2007, SSW.

[13]  Heiga Zen,et al.  Speaker-Independent HMM-based Speech Synthesis System , 2007 .

[14]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[16]  Tetsunori Kobayashi,et al.  Hybrid Voice Conversion of Unit Selection and Generation Using Prosody Dependent HMM , 2006, IEICE Trans. Inf. Syst..