论文信息 - Speaking Styles: Statistical Analysis and Synthesis by a Text-to-Speech System

Speaking Styles: Statistical Analysis and Synthesis by a Text-to-Speech System

This chapter considers the enhancement of text-to-speech (TTS) systems by the synthesis of various speaking styles. The first part statistically analyzes speech uttered in three speaking styles. The speaking styles are indicated by text content: a paragraph of an literary novel, advertisement phrases, and a paragraph of an encyclopedia. A professional narrator uttered the three texts in speaking styles that he thought were appropriate. Characteristics of each speaking style are observed in F 0, duration, power, formant frequency, and spectral tilts. Based on the analysis results, we propose a strategy that permits a TTS system to synthesize speech in various speaking styles. Rules are integrated into a conventional TTS system, and listening tests show good performance of the proposed TTS system.

Masanobu Abe

[1] Tomohisa Hirokawa,et al. High quality speech synthesis based on wavelet compilation of phoneme segments , 1992, ICSLP.

[2] Chikio Hayashi. On the quantification of qualitative data from the mathematico-statistical point of view , 1950 .

[3] Maxine Eskénazi. Changing speech styles: strategies in read speech and casual and careful spontaneous speech , 1992, ICSLP.

[4] Satoshi Nakamura,et al. Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[5] Rolf Carlson,et al. Synthesis: Modeling variability and constraints , 1991, Speech Commun..

[6] Maxine Eskénazi,et al. Trends in speaking styles research , 1993, EUROSPEECH.

[7] Joan A. Argente. From speech to speaking styles , 1992, Speech Commun..

[8] Yoshinori Kitahara,et al. Prosodic components of speech in the expression of emotions , 1988 .

[9] Masanobu Abe,et al. Voice conversion based on piecewise linear conversion rules of formant frequency and spectrum tilt , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Yoshinori Sagisaka,et al. On sentence-level factors governing segmental duration in Japanese , 1989 .

[11] D P Egolf,et al. A technique for simulating the amplifier-to-eardrum transfer function of an in situ hearing aid. , 1988, The Journal of the Acoustical Society of America.

[12] D. R. Ladd,et al. Manipulating synthetic intonation for speaker characterisation , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[13] H. Sato,et al. Two-stage F/sub 0/ control model using syllable based F/sub 0/ units , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14] R. Hogg. How to Hope with Statistics , 1989 .

[15] Tomohisa Hirokawa,et al. Segmental power control for Japanese speech synthesis , 1992, ICSLP.