A novel unit selection method for concatenation speech system using similarity measure

This paper presents a new approach to unit selection for corpus-based TTS system, in which the units are selected according to their similarity with synthetic target generated by a parametric synthesizer. In the training stage, a group of classifiers are trained based on human perceptual judgments. The outputs of the classifiers are used to make a distinction rather than using traditional methods such as continuously-valued cost. In order to obtain a better classification result, different combinations of features are tried as input vectors, and the similarity rating is carried out dexterously. Subjective listening tests on a Mandarin female TTS system show that the proposed classifier based speech synthesis system outperforms the traditional unit-selection system.

[1]  Heng Lu,et al.  The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 , 2007 .

[2]  Toshio Hirai,et al.  Using 5 ms segments in concatenative speech synthesis , 2004, SSW.

[3]  Ian Vince McLoughlin,et al.  Line spectral pairs , 2008, Signal Process..

[4]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[6]  Simon King,et al.  A classifier-based target cost for unit selection speech synthesis trained on perceptual data , 2010, INTERSPEECH.

[7]  Ren-Hua Wang,et al.  HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Jerome R. Bellegarda A Dynamic Cost Weighting Framework for Unit Selection Text–to–Speech Synthesis , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[11]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12]  N. Iwahashi,et al.  Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization , 1993 .