Natural-sounding speech synthesis using variable-length units

The goal of this work was to develop a speech synthesis system which concatenates variable-length units to create naturalsounding speech. Our initial work in this area showed that by careful design of system responses to ensure consistent intonation contours, natural-sounding speech synthesis was achievable with wordand phrase-level concatenation. In order to extend the flexibility of this framework, we focused on the problem of generating novel words from a corpus of sub-word units. The design of the sub-word units was motivated by perceptual studies that investigated where speech could be spliced with minimal audible distortion and what contextual constraints were necessary to maintain in order to produce natural sounding speech. The sub-word corpus is searched during synthesis using a Viterbi search which selects a sequence of units based on how well they individually match the input specification and on how well they sound as an ensemble. This concatenative speech synthesis system, ENVOICE, has been used in a conversational information retrieval system in two application domains to convert meaning representations into speech waveforms.

[1]  Robert H. Kassel,et al.  A comparison of approaches to on-line handwritten character recognition , 1995 .

[2]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[3]  Victor Zue,et al.  WHEELS: a conversational system in the automobile classifieds domain , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  A. U.S.,et al.  COMBINATORIAL ISSUES IN TEXT-TO-SPEECH SYNTHESIS , 1997 .

[5]  James R. Glass,et al.  Multilingual language generation across multiple domains , 1994, ICSLP.

[6]  Alex Acero,et al.  Whistler: a trainable text-to-speech system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  Rpg Rene Collier,et al.  The generation of prosodic structure and intonation in speech synthesis , 1995 .

[8]  Eric Moulines,et al.  A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[9]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[10]  Stefanie Shattuck-Hufnagel,et al.  A prosody tutorial for investigators of auditory sentence processing , 1996, Journal of psycholinguistic research.

[11]  Victor Zue,et al.  GALAXY: a human-language interface to on-line travel information , 1994, ICSLP.

[12]  James R. Glass,et al.  A probabilistic framework for feature-based speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[13]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[14]  Gérard Bailly,et al.  Talking Machines: Theories, Models, and Designs , 1992 .

[15]  Stephanie Seneff,et al.  ANGIE: a new framework for speech analysis based on morpho-phonological modelling , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[16]  Rob Kassel Automating the design of compact linguistic corpora , 1994, ICSLP.

[17]  Victor W. Zue,et al.  Acoustic Characteristics of Stop Consonants: A Controlled Study , 1976 .

[18]  Nick Campbell,et al.  Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[19]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[20]  Stephanie Seneff,et al.  Hierarchical duration modelling for speech recognition using the ANGIE framework , 1997, EUROSPEECH.

[21]  James Glass,et al.  A Conversational Sys-tem in the Automobile Classi eds Domain , 1996 .

[22]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[23]  Michael K. McCandless,et al.  SAPPHIRE: an extensible speech analysis and recognition tool based on Tcl/Tk , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[24]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[25]  Eleonora Cavalcante Albano,et al.  Linguistic criteria for building and recording units for concatenative speech synthesis in brazilian portuguese , 1997, EUROSPEECH.

[26]  Victor Zue,et al.  From interface to content: translingual access and delivery of on-line information , 1997, EUROSPEECH.