A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units

Concatenative synthesis and statistical synthesis are the two main approaches to text-to-speech (TTS) synthesis. Concatenative TTS (CTTS) stores natural speech features segments, selected from a recorded speech database. Consequently, CTTS systems enable speech synthesis with natural quality. However, as the footprint of the stored data is reduced, desired segments are not always available in the stored data, and audible discontinuities may result. On the other hand, statistical TTS (STTS) systems, in spite of having a smaller footprint than CTTS, synthesize speech that is free of such discontinuities. Yet, in general, STTS produces lower quality speech than CTTS, in terms of naturalness, as it is often sounding muffled. The muffling effect is due to over-smoothing of model-generated speech features. In order to gain from the advantages of each of the two approaches, we propose in this work to combine CTTS and STTS into a hybrid TTS (HTTS) system. Each utterance representation in HTTS is constructed from natural segments and model generated segments in an interweaved fashion via a hybrid dynamic path algorithm. Reported listening tests demonstrate the validity of the proposed approach.

[1]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[2]  David Malah,et al.  Statistical Text-to-Speech Synthesis Based on Segment-Wise Representation With a Norm Constraint , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Ren-Hua Wang,et al.  Minimum unit selection error training for HMM-based unit selection speech synthesis system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Bhuvana Ramabhadran,et al.  The IBM Submission to the 2008 Text-to-Speech Blizzard Challenge , 2008 .

[6]  Zhiwei Shuang,et al.  High Quality Sinusoidal Modeling of Wideband Speech for the Purposes of Speech Synthesis and Modification , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Aaron E. Rosenberg,et al.  On the use of instantaneous and transitional spectral information in speaker recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Slava Shechtman,et al.  Small footprint concatenative text-to-speech synthesis system using complex spectral envelope modeling , 2005, INTERSPEECH.

[9]  Vincent Pollet,et al.  Synthesis by generation and concatenation of multiform segments , 2008, INTERSPEECH.

[10]  Ren-Hua Wang,et al.  The USTC System for Blizzard Challenge 2010 , 2008 .

[11]  S. Furui Speaker-Independent Isolated Word Recognition Based on Dynamics-Emphasized Cepstrum , 1986 .

[12]  Michael Picheny,et al.  The IBM Submission to the 2006 Blizzard Text-to-Speech Challenge , 2006 .

[13]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Robert E. Donovan Topics in decision tree based speech synthesis , 2003, Comput. Speech Lang..

[15]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[16]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17]  Junichi Yamagishi,et al.  Combining Statistical Parameteric Speech Synthesis and Unit-Selection for Automatic Voice Cloning , 2008 .

[18]  Ren-Hua Wang,et al.  HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[19]  Heng Lu,et al.  The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 , 2007 .

[20]  Tetsunori Kobayashi,et al.  Hybrid Voice Conversion of Unit Selection and Generation Using Prosody Dependent HMM , 2006, IEICE Trans. Inf. Syst..

[21]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[22]  Joan Claudi Socoró,et al.  Local minimum generation error criterion for hybrid HMM speech synthesis , 2009, INTERSPEECH.

[23]  Keiichi Tokuda,et al.  XIMERA: a new TTS from ATR based on corpus-based technologies , 2004, SSW.

[24]  Alex Acero,et al.  HMM-based smoothing for concatenative speech synthesis , 1998, ICSLP.