Statistical Modeling for Unit Selection in Speech Synthesis

Traditional concatenative speech synthesis systems use a number of heuristics to define the target and concatenation costs, essential for the design of the unit selection component. In contrast to these approaches, we introduce a general statistical modeling framework for unit selection inspired by automatic speech recognition. Given appropriate data, techniques based on that framework can result in a more accurate unit selection, thereby improving the general quality of a speech synthesizer. They can also lead to a more modular and a substantially more efficient system.We present a new unit selection system based on statistical modeling. To overcome the original absence of data, we use an existing high-quality unit selection system to generate a corpus of unit sequences. We show that the concatenation cost can be accurately estimated from this corpus using a statistical n-gram language model over units. We used weighted automata and transducers for the representation of the components of the system and designed a new and more efficient composition algorithm making use of string potentials for their combination. The resulting statistical unit selection is shown to be about 2.6 times faster than the last release of the AT&T Natural Voices Product while preserving the same quality, and offers much flexibility for the use and integration of new and more complex components.

[1]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[2]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[3]  Jean Berstel,et al.  Transductions and context-free languages , 1979, Teubner Studienbücher : Informatik.

[4]  Brian Roark,et al.  A General Weighted Grammar Library , 2004, CIAA.

[5]  Ann K. Syrdal,et al.  Preselection of candidate units in a unit selection-based text-to-speech synthesis system , 2000, INTERSPEECH.

[6]  Ronald Rosenfeld,et al.  Scalable backoff language models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Arto Salomaa,et al.  Semirings, Automata and Languages , 1985 .

[8]  András Kornai Extended finite state models of language , 1996, Nat. Lang. Eng..

[9]  Arto Salomaa,et al.  Automata-Theoretic Aspects of Formal Power Series , 1978, Texts and Monographs in Computer Science.

[10]  Mehryar Mohri Weighted Finite-State Transducer Algorithms. An Overview , 2004 .

[11]  Fernando Pereira,et al.  Weighted Automata in Text and Speech Processing , 2005, ArXiv.

[12]  Andreas Stolcke,et al.  Entropy-based Pruning of Backoff Language Models , 2000, ArXiv.

[13]  Mehryar Mohri,et al.  Semiring Frameworks and Algorithms for Shortest-Distance Problems , 2002, J. Autom. Lang. Comb..

[14]  Thierry Dutoit,et al.  Diphone concatenation using a harmonic plus noise model of speech , 1997, EUROSPEECH.

[15]  Mehryar Mohri,et al.  The Design Principles of a Weighted Finite-State Transducer Library , 2000, Theor. Comput. Sci..

[16]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[17]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[18]  Michael Riley,et al.  Speech Recognition by Composition of Weighted Finite Automata , 1996, ArXiv.

[19]  Arto Salomaa,et al.  Semirings, Automata, Languages , 1985, EATCS Monographs on Theoretical Computer Science.

[20]  Mari Ostendorf,et al.  Unit selection for speech synthesis using splicing costs with weighted finite state transducers , 2001, INTERSPEECH.

[21]  James R. Glass,et al.  A flexible, scalable finite-state transducer architecture for corpus-based concatenative speech synthesis , 2000, INTERSPEECH.

[22]  Mehryar Mohri,et al.  Rapid unit selection from a large speech corpus for concatenative speech synthesis , 1999, EUROSPEECH.