论文信息 - A trial of communicative prosody generation based on control characteristic of one word utterance observed in real conversational speech

A trial of communicative prosody generation based on control characteristic of one word utterance observed in real conversational speech

Aiming at prosody control for conversational speech synthesis, communicative prosodies were generated based on the prosodic characteristics derived from one word utterance “ n” . The grouping of F0 patterns using VQ revealed four F0 dynamic patterns (rise, gradual fall, fall, and rise&fall) for large amounts of one-word utterance “n” in daily conversations. Through the analysis using an F0 generation model, different control characteristics were found for these patterns. A communicative prosody control scheme is proposed for short utterances reflecting these control characteristics for three dimensional representative perceptual impressions, confident-doubtful, allowable-unacceptable and positive-negative previously obtained by MDS analysis. The naturalness evaluation tests for synthesized conversational speech showed superiority in naturalness of the proposed prosody control. These results indicate the possibility of communicative prosody generation for conversational speech synthesis through perceptional impression expressions using corpus-based approach.

Yoshinori Sagisaka | Minoru Tsuzaki | Hiroaki Kato | Yoko Greenberg | Nagisa Shibuya

[1] Keiichi Tokuda,et al. Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2] Y. Sagisaka,et al. On the prediction of global F/sub 0/ shape for Japanese text-to-speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[3] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[4] Christof Traber,et al. SVOX: the implementation of a text-to-speech system for German , 1995 .

[5] M. D. Riley. Tree-based modeling of segmental durations , 1992 .

[6] Yoshinori Sagisaka,et al. F0 control characterization by perceptual impressions on speaking attitudes using multiple dimensional scaling analysis , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7] Keikichi Hirose,et al. Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[8] Yoshinori Sagisaka,et al. Communicative speech synthesis using constituent word attributes , 2005, INTERSPEECH.