The IBM expressive text-to-speech synthesis system for American English

Expressive text-to-speech (TTS) synthesis should contribute to the pleasantness, intelligibility, and speed of speech-based human-machine interactions which use TTS. We describe a TTS engine which can be directed, via text markup, to use a variety of expressive styles, here, questioning, contrastive emphasis, and conveying good and bad news. Differences in these styles lead us to investigate two approaches for expressive TTS, a "corpus-driven" and a "prosodic-phonology" approach. Each speaker records 11 h (excluding silences) of "neutral" sentences. In the corpus-driven approach, the speaker also records 1-h corpora in each expressive style; these segments are tagged by style for use during search, and decision trees for determining f0 contours and timing are trained separately for each of the neutral and expressive corpora. In the prosodic-phonology approach, rules translating certain expressive markup elements to tones and break indices (ToBI) are manually determined, and the ToBI elements are used in single f0 and duration trees for all expressions. Tests show that listeners identify synthesis in particular styles ranging from 70% correctly for "conveying bad news" to 85% for "yes-no questions". Further improvements are demonstrated through the use of speaker-pooled f0 and duration models

[1]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Ann K. Syrdal,et al.  Inter-transcriber reliability of toBI prosodic labeling , 2000, INTERSPEECH.

[3]  Alan W. Black,et al.  Generating F/sub 0/ contours from ToBI labels using linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  Hisashi Kihara,et al.  digital audio signal processing , 1990 .

[5]  R. Bogartz An introduction to the analysis of variance , 1994 .

[6]  C. W. Wightman ToBI Or Not ToBI ? , 2002 .

[7]  Raimo Bakis,et al.  Reconciling pronunciation differences between the front-end and the back-end in the IBM speech synthesis system , 2004, INTERSPEECH.

[8]  Clifford Nass,et al.  Perceptual user interfaces: perceptual bandwidth , 2000, CACM.

[9]  Julia Hirschberg,et al.  Automatic ToBI prediction and alignment to speed manual labeling of prosody , 2001, Speech Commun..

[10]  E. Eide Preservation, identification, and use of emotion in a text-to-speech system , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[11]  Robert E. Donovan,et al.  A new distance measure for costing spectral discontinuities in concatenative speech synthesizers , 2001, SSW.

[12]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[13]  Shrikanth S. Narayanan,et al.  Expressive speech synthesis using a concatenative synthesizer , 2002, INTERSPEECH.

[14]  Michael Picheny,et al.  Context dependent vector quantization for continuous speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Julia Hirschberg,et al.  Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[16]  Udo Zoelzer Digital Audio Signal Processing , 2008 .

[17]  Matthew J. Makashay,et al.  Corpus-based techniques in the AT&t nextgen synthesis system , 2000, INTERSPEECH.

[18]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[19]  J. F. Pitrelli,et al.  Expressive speech synthesis using American English ToBI: questions and contrastive emphasis , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[20]  Justin Fackrell,et al.  Segment selection in the L&h Realspeak laboratory TTS system , 2000, INTERSPEECH.

[21]  Robert E. Donovan,et al.  Data-driven segment preselection in the IBM trainable speech synthesis system , 2002, INTERSPEECH.

[22]  Philip C. Woodland,et al.  Improvements in an HMM-based speech synthesiser , 1995, EUROSPEECH.

[23]  Giuseppe Riccardi,et al.  Prosody recognition from speech utterances using acoustic and linguistic based models of prosodic events , 1999, EUROSPEECH.

[24]  Hartmut R. Pfitzinger Intrinsic phone durations are speaker-specific , 2002, INTERSPEECH.

[25]  Raimo Bakis,et al.  Multilayered extensions to the speech synthesis markup language for describing expressiveness , 2003, INTERSPEECH.