Candidate Expansion and Prosody Adjustment for Natural Speech Synthesis Using a Small Corpus

This study proposes a hybrid approach to natural-sounding speech synthesis based on candidate expansion, unit selection, and prosody adjustment using a small corpus. The proposed method is more specific to tonal language, in particular Mandarin. In conventional speech synthesis studies, the quality of synthesized speech depends heavily on the size of the speech corpus. However, it is highly time-consuming and labor-intensive to prepare a large labeled corpus. In this work, candidate expansion is proposed to retrieve potential candidates that are unlikely to be retrieved using only linguistic features. The optimal unit sequence is then obtained from the expanded candidates by using the proposed unit selection mechanism at the phoneme and prosodic word levels. Finally, a prosodic word-level prosody adjustment is proposed to improve the continuity and smoothness of the prosody of the synthesized speech. To evaluate the proposed method, the Tsing-Hua corpus of speech synthesis was adopted. The results of an objective evaluation demonstrate the effectiveness of candidate expansion and the improvement of the continuity and smoothness of the prosody of the synthesized speech. The results of a subjective evaluation also show the proposed system could synthesize the speech with improved quality and naturalness, in particular for a small-sized or resource-limited corpus.

[1]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[2]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[3]  Chung-Hsien Wu,et al.  Polyglot Speech Synthesis Based on Cross-Lingual Frame Selection Using Auditory and Articulatory Features , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Mari Ostendorf,et al.  Joint prosody prediction and unit selection for concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[5]  Ren-Hua Wang,et al.  HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Chung-Hsien Wu,et al.  Automatic generation of synthesis units and prosodic information for Chinese concatenative synthesis , 2001, Speech Commun..

[7]  Yu Tsao,et al.  A study on detection based automatic speech recognition , 2006, INTERSPEECH.

[8]  Chung-Hsien Wu,et al.  Natural speech synthesis based on hybrid approach with candidate expansion and verification , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Sin-Horng Chen,et al.  Vector quantization of pitch information in Mandarin speech , 1990, IEEE Trans. Commun..

[10]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[13]  Eduardo Rodríguez Banga,et al.  A method for combining intonation modelling and speech unit selection in corpus-based speech synthesis systems , 2006, Speech Commun..

[14]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[15]  Ren-Hua Wang,et al.  The USTC System for Blizzard Challenge 2010 , 2008 .

[16]  Chung-Hsien Wu,et al.  Variable-Length Unit Selection in TTS Using Structural Syntactic Cost , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Chin-Hui Lee,et al.  A penalized logistic regression approach to detection based phone classification , 2008, INTERSPEECH.

[18]  Chung-Hsien Wu,et al.  Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Mari Ostendorf,et al.  Moving beyond the 'beads-on-a-string' model of speech , 1999 .

[20]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[21]  Chin-Hui Lee,et al.  Toward a detector-based universal phone recognizer , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Chung-Hsien Wu,et al.  Hierarchical Prosody Conversion Using Regression-Based Clustering for Emotional Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Tanja Schultz,et al.  Multilingual articulatory features , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[24]  Thierry Dutoit,et al.  Using a pitch-synchronous residual codebook for hybrid HMM/frame selection speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Chung-Hsien Wu,et al.  Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[27]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[28]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[29]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[30]  Cenk Demiroglu,et al.  Analysis of speaker similarity in the statistical speech synthesis systems using a hybrid approach , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[31]  Cai Rui TH-CoSS,a Mandarin Speech Corpus for TTS , 2007 .

[32]  David Malah,et al.  A Hybrid Text-to-Speech System That Combines Concatenative and Statistical Synthesis Units , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[34]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  Chung-Hsien Wu,et al.  Phone set construction based on context-sensitive articulatory attributes for code-switching speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[37]  Shinsuke Sakai,et al.  A probabilistic approach to unit selection for corpus-based speech synthesis , 2005, INTERSPEECH.

[38]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[39]  Inma Hernáez,et al.  A Hybrid TTS Approach for Prosody and Acoustic Modules , 2011, INTERSPEECH.

[40]  Vincent Pollet,et al.  Synthesis by generation and concatenation of multiform segments , 2008, INTERSPEECH.

[41]  Chung-Hsien Wu,et al.  Personalized Spectral and Prosody Conversion Using Frame-Based Codeword Distribution and Adaptive CRF , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Keiichi Tokuda,et al.  Mel-generalized cepstral analysis - a unified approach to speech spectral estimation , 1994, ICSLP.

[43]  Chung-Hsien Wu,et al.  Residual compensation based on articulatory feature-based phone clustering for hybrid Mandarin speech synthesis , 2013, SSW.

[44]  P. Hoole,et al.  Tone-Vowel Interaction in Standard Chinese , 2004 .