Information-theoretic criteria for unit selection synthesis

In our recent work on concatenative speech synthesis, we have devised an efficient, graph-based search to perform unit selection given symbolic information. By encapsulating concatenation and substitution costs defined at the class level, the graph expands only linearly with respect to corpus size. To date, these costs were manually tuned over pre-specified classes, which was a knowledgeintensive engineering process. In this research paper, we turn to information-theoretic metrics for automatically learning the costs from data. These costs can be analyzed in a minimum description length (MDL) framework. The performance of these automatically determined weights is compared against that of manually tuned weights in a perceptual evaluation.

[1]  Michael W. Macon,et al.  Control of spectral dynamics in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[2]  Raymond N. J. Veldhuis,et al.  Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[3]  Mehryar Mohri,et al.  Rapid unit selection from a large speech corpus for concatenative speech synthesis , 1999, EUROSPEECH.

[4]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[5]  Raymond N. J. Veldhuis,et al.  On the reduction of concatenation artefacts in diphone synthesis , 1998, ICSLP.

[6]  James R. Glass,et al.  A probabilistic framework for feature-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Paul A. Viola,et al.  Empirical Entropy Manipulation for Real-World Problems , 1995, NIPS.

[8]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[9]  James R. Glass,et al.  A flexible, scalable finite-state transducer architecture for corpus-based concatenative speech synthesis , 2000, INTERSPEECH.

[10]  Andrew K. Halberstadt Heterogeneous acoustic measurements and multiple classifiers for speech recognition , 1999 .

[11]  Mari Ostendorf,et al.  Unit selection for speech synthesis using splicing costs with weighted finite state transducers , 2001, INTERSPEECH.

[12]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Stephanie Seneff,et al.  GENESIS-II: a versatile system for language generation in conversational system applications , 2000, INTERSPEECH.

[14]  Mari Ostendorf,et al.  Joint prosody prediction and unit selection for concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[15]  James R. Glass,et al.  Real-time telephone-based speech recognition in the Jupiter domain , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[16]  Ann K. Syrdal,et al.  Preselection of candidate units in a unit selection-based text-to-speech synthesis system , 2000, INTERSPEECH.

[17]  James R. Glass,et al.  Natural-sounding speech synthesis using variable-length units , 1998, ICSLP.