Speaking Styles: Statistical Analysis and Synthesis by a Text-to-Speech System

This chapter considers the enhancement of text-to-speech (TTS) systems by the synthesis of various speaking styles. The first part statistically analyzes speech uttered in three speaking styles. The speaking styles are indicated by text content: a paragraph of an literary novel, advertisement phrases, and a paragraph of an encyclopedia. A professional narrator uttered the three texts in speaking styles that he thought were appropriate. Characteristics of each speaking style are observed in F 0, duration, power, formant frequency, and spectral tilts. Based on the analysis results, we propose a strategy that permits a TTS system to synthesize speech in various speaking styles. Rules are integrated into a conventional TTS system, and listening tests show good performance of the proposed TTS system.

[1]  Tomohisa Hirokawa,et al.  High quality speech synthesis based on wavelet compilation of phoneme segments , 1992, ICSLP.

[2]  Chikio Hayashi On the quantification of qualitative data from the mathematico-statistical point of view , 1950 .

[3]  Maxine Eskénazi Changing speech styles: strategies in read speech and casual and careful spontaneous speech , 1992, ICSLP.

[4]  Satoshi Nakamura,et al.  Voice conversion through vector quantization , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[5]  Rolf Carlson,et al.  Synthesis: Modeling variability and constraints , 1991, Speech Commun..

[6]  Maxine Eskénazi,et al.  Trends in speaking styles research , 1993, EUROSPEECH.

[7]  Joan A. Argente From speech to speaking styles , 1992, Speech Commun..

[8]  Yoshinori Kitahara,et al.  Prosodic components of speech in the expression of emotions , 1988 .

[9]  Masanobu Abe,et al.  Voice conversion based on piecewise linear conversion rules of formant frequency and spectrum tilt , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Yoshinori Sagisaka,et al.  On sentence-level factors governing segmental duration in Japanese , 1989 .

[11]  D P Egolf,et al.  A technique for simulating the amplifier-to-eardrum transfer function of an in situ hearing aid. , 1988, The Journal of the Acoustical Society of America.

[12]  D. R. Ladd,et al.  Manipulating synthetic intonation for speaker characterisation , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[13]  H. Sato,et al.  Two-stage F/sub 0/ control model using syllable based F/sub 0/ units , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  R. Hogg How to Hope with Statistics , 1989 .

[15]  Tomohisa Hirokawa,et al.  Segmental power control for Japanese speech synthesis , 1992, ICSLP.