The USTC System for Blizzard Challenge 2010

This paper introduces the speech synthesis system developed by USTC for Blizzard Challenge 2010. USTC attended all English tasks including the hub tasks and the spoke tasks. According to the various conditions for different tasks, different versions of synthesis systems are constructed. Many new techniques are employed in our speech synthesis system construction. Results of internal experiments comparing these techniques are shown, and analyzed. The evaluation results of Blizzard Challenge 2010 prove that our system has good quality in the naturalness, similarity. But in the intelligibility of the synthetic speech, the results are not good enough.

[1]  Abeer Alwan,et al.  Text to Speech Synthesis: New Paradigms and Advances , 2004 .

[2]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006, Blizzard Challenge.

[3]  Milos Cernak Unit Selection Speech Synthesis in Noise , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Chin-Hui Lee,et al.  HIDDEN MARKOV MODEL ADAPTATION USING MAXIMUM A POSTERIORI LINEAR REGRESSION , 1999 .

[5]  Heng Lu,et al.  The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 , 2007 .

[6]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Julia Hirschberg,et al.  Automatic ToBI prediction and alignment to speed manual labeling of prosody , 2001, Speech Commun..

[8]  Zhi-Jie Yan,et al.  An HMM trajectory tiling (HTT) approach to high quality TTS , 2010, INTERSPEECH.

[9]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  Toshio Hirai,et al.  Using 5 ms segments in concatenative speech synthesis , 2004, SSW.

[11]  Koichi Shinoda,et al.  Structural MAP speaker adaptation using hierarchical priors , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[12]  Li-Rong Dai,et al.  Statistical modeling of syllable-level F0 features for HMM-based unit selection speech synthesis , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[13]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[14]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[15]  Alan W. Black,et al.  Creating a database of speech in noise for unit selection synthesis , 2004, SSW.

[16]  Wu Guo,et al.  Minimum generation error criterion for tree-based clustering of context dependent HMMs , 2006, INTERSPEECH.

[17]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[18]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[19]  Tomoki Toda,et al.  Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Ren-Hua Wang,et al.  HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[22]  Ren-Hua Wang,et al.  Minimum unit selection error training for HMM-based unit selection speech synthesis system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Mark J. F. Gales,et al.  Lightly supervised recognition for automatic alignment of large coherent speech recordings , 2010, INTERSPEECH.

[24]  Zhigang Cao,et al.  Phonetic transcription verification with generalized posterior probability , 2005, INTERSPEECH.

[25]  Heiga Zen,et al.  Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters , 2010, SSW.

[26]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[27]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[28]  Simon King,et al.  The Blizzard Challenge 2007 , 2007 .

[29]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[30]  Heiga Zen,et al.  Tying covariance matrices to reduce the footprint of HMM-based speech synthesis systems , 2009, INTERSPEECH.