A Japanese text-to-speech system based on multi-form units with consideration of frequency distribution in Japanese

This paper proposes our new text-to-speech (TTS) system that concatenates large numbers of speech segments to produce very natural and intelligible synthetic speech. One novel point of our system is its new synthesis unit, which is has three remarkable characteristics as follows; The synthesis units contain all Japanese syllables together with all possible vowel sequences, so very smooth synthetic speech is produced. Both previous and succeeding phoneme environments are considered when speech segments are concatenated, so natural sounding transients from a vowel to a consonant, which is the only concatenation point with the proposed unit, are present in the synthetic speech. Each unit has various fundamental frequency (F0) contours. Therefore, F0 modification rates are very small in any synthesis event, and the F0 modification process causes only minor distortion. To develop a unit database efficiently and effectively, we analyzed 4,850,000 Japanese phrases (breath-group) containing 87,810,000 phonemes and ranked them in order of appearance frequency. Listening tests confirm the high intelligibility and naturalness of speech produced by our new TTS system. It uses the 50,000 highest frequency units that cover over 77% of Japanese texts.

[1]  Kimihito TANAKA,et al.  A new fundamental frequency modification algorithm with transformation of spectrum envelope according to F/sub 0/ , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Masanobu Abe,et al.  Development of speech design tool "SESIGN99" to enhance synthesized speech , 1999, EUROSPEECH.

[3]  Г Фант,et al.  Акустическая теория речеобразования. (Acoustic theory of speech production, 1960) , 1964 .

[4]  Eric Moulines,et al.  Voice transformation using PSOLA technique , 1991, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  Shin'ya Nakajima,et al.  A new method of generating speech synthesis units based on phonological knowledge and clustering technique , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[7]  K. Hakoda,et al.  Japanese text-to-speech synthesizer based on residual excited speech synthesis , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Michael W. Macon,et al.  Generalization and discrimination in tree-structured unit selection , 1998, SSW.

[9]  Nick Campbell,et al.  Prosody-based unit selection for Japanese speech synthesis , 1998, SSW.

[10]  Yannis Stylianou Concatenative speech synthesis using a harmonic plus noise model , 1998, SSW.

[11]  Masanobu Abe,et al.  A new F0 modification algorithm by manipulating harmonics of magnitude spectrum , 1999, EUROSPEECH.

[12]  K. Hakoda,et al.  Japanese Text-To-Speech Software based on Wave Form Concatenation Method , 1995 .

[13]  Hideyuki Mizuno,et al.  A new F0 contour control method based on vector representation of F0 contour , 1999, EUROSPEECH.