Natural speech synthesis based on hybrid approach with candidate expansion and verification

A hybrid Mandarin speech synthesis system combining concatenation-based and model-based methodology is investigated in this research. To effectively exploit a small-size corpus, the candidate sets for unit selection are expanded via clusters based on articulatory features (AF), which are estimated as the outputs of an artificial neural network. This is followed by a filtering operation incorporating residual compensation, to remove unsuitable units. Given an input text, an optimal unit sequence is decided by the minimization of a total cost, which depends on the spectral features, contextual articulatory features, formants, and pitch values. Furthermore, prosodic word verification is integrated to check the smoothness of the output speech. The units failing to pass the prosodic word verification are replaced by model-based synthesized units for better speech quality. Objective and subjective evaluations have been conducted. Comparisons among the proposed method, the HMM-based method, and the conventional hybrid method clearly show that candidate set expansion based on articulatory features lead to more units suitable for selection, and the verification process is effective in improving the naturalness of the output speech.

[1]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[2]  Chung-Hsien Wu,et al.  Automatic generation of synthesis units and prosodic information for Chinese concatenative synthesis , 2001, Speech Commun..

[3]  Chung-Hsien Wu,et al.  Personalized Spectral and Prosody Conversion Using Frame-Based Codeword Distribution and Adaptive CRF , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Chung-Hsien Wu,et al.  Residual compensation based on articulatory feature-based phone clustering for hybrid Mandarin speech synthesis , 2013, SSW.

[5]  Shinsuke Sakai,et al.  A probabilistic approach to unit selection for corpus-based speech synthesis , 2005, INTERSPEECH.

[6]  Eduardo Rodríguez Banga,et al.  A method for combining intonation modelling and speech unit selection in corpus-based speech synthesis systems , 2006, Speech Commun..

[7]  Inma Hernáez,et al.  A Hybrid TTS Approach for Prosody and Acoustic Modules , 2011, INTERSPEECH.

[8]  Cenk Demiroglu,et al.  Analysis of speaker similarity in the statistical speech synthesis systems using a hybrid approach , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[9]  Cai Rui TH-CoSS,a Mandarin Speech Corpus for TTS , 2007 .

[10]  Chung-Hsien Wu,et al.  Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  P. Hoole,et al.  Tone-Vowel Interaction in Standard Chinese , 2004 .

[12]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[14]  Thierry Dutoit,et al.  Using a pitch-synchronous residual codebook for hybrid HMM/frame selection speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  David Malah,et al.  A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  Chung-Hsien Wu,et al.  Variable-Length Unit Selection in TTS Using Structural Syntactic Cost , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Chung-Hsien Wu,et al.  Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.