Mutual-information based segment pre-selection in concatenative text-to-speech

Corpus based Concatenative Text-To-Speech (CTTS) systems have been proven a successful method to produce good voice quality speech. However, It requires a large inventory of synthesis segments and complex search algorithms, which sometimes hinder the usability of CTTS. Segment pre-selection targets to prune the candidate segments to achieve the best possible synthesis quality within a pre-defined inventory size. Making CTTS usable in environments where memory and CPU are critically constrained. This paper presents a novel pre-selection method in which Mutual Information (MI), a well-known concept in statistics, is integrated. Objective and subjective evaluations of the synthesized speech have proven that this new approach out-performs two conventional pre-selection methods popularly used in current CTTS systems.

[1]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[2]  Wei Zhang,et al.  Probability based prosody model for unit selection , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Robert E. Donovan Segment pre-selection in decision-tree based speech synthesis systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Robert E. Donovan,et al.  Data-driven segment preselection in the IBM trainable speech synthesis system , 2002, INTERSPEECH.

[5]  Robert E. Donovan,et al.  A new distance measure for costing spectral discontinuities in concatenative speech synthesizers , 2001, SSW.

[6]  Yu Hu,et al.  A new method of building decision tree based on target information , 2002, INTERSPEECH.

[7]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[8]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.