论文信息 - Expressive speech synthesis using American English ToBI: questions and contrastive emphasis

Expressive speech synthesis using American English ToBI: questions and contrastive emphasis

We describe American English concatenative text-to-speech synthesis experiments in which "expressions", namely questioning and contrastive emphasis, are each associated with a ToBI prosodic template. ToBI labels, along with text features, are in turn incorporated into decision-tree models of F0 and segment duration to be used during synthesis, sparing the need for expression-specific large corpora and decision trees. Synthesizing using this approach enables listeners to perform the difficult task of distinguishing yes-no questions from identically-worded declarative sentences 78% of the time, compared to the baseline system's 50%. For contrastive emphasis, a sentence is synthesized with emphasis on a word which is chosen appropriately or inappropriately based on a preceding sentence. Listeners' mean opinion scores for appropriate emphases exceed inappropriate by 0.40 on a 1-to-5 scale for the experimental system, compared to a difference of 0.11 for the baseline, a significant system difference (p<0.01).

J. F. Pitrelli | E. M. Eide

[1] D. Ladd. Phonological Features of Intonational Peaks , 1983 .

[2] Robert E. Donovan,et al. Data-driven segment preselection in the IBM trainable speech synthesis system , 2002, INTERSPEECH.

[3] Shrikanth S. Narayanan,et al. Expressive speech synthesis using a concatenative synthesizer , 2002, INTERSPEECH.

[4] Julia Hirschberg,et al. Evaluation of prosodic transcription labeling reliability in the tobi framework , 1994, ICSLP.

[5] Giuseppe Riccardi,et al. Prosody recognition from speech utterances using acoustic and linguistic based models of prosodic events , 1999, EUROSPEECH.

[6] Julia Hirschberg,et al. Automatic ToBI prediction and alignment to speed manual labeling of prosody , 2001, Speech Commun..

[7] C. W. Wightman. ToBI Or Not ToBI ? , 2002 .

[8] Mari Ostendorf,et al. TOBI: a standard for labeling English prosody , 1992, ICSLP.

[9] E. Eide. Preservation, identification, and use of emotion in a text-to-speech system , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[10] Mahesh Viswanathan,et al. Recent improvements to the IBM trainable speech synthesis system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..