论文信息 - Unit selection algorithm for Japanese speech synthesis based on both phoneme unit and diphone unit

Unit selection algorithm for Japanese speech synthesis based on both phoneme unit and diphone unit

This paper proposes a novel unit selection algorithm for Japanese Text-To-Speech (TTS) systems. Since Japanese syllables consist of CV (C: Consonant, V: Vowel) or V, except when a vowel is devoiced, CV units are basic to concatenative TTS systems for Japanese. However, speech synthesized with CV units sometimes have discontinuities due to V-V concatenation; In order to alleviate such discontinuities, longer units (CV* or non-uniform units) have been proposed. However, the concatenation between V and V is still unavoidable. To address this problem, we propose a novel unit selection algorithm that incorporates not only phoneme units but also diphone units. The concatenation in the proposed algorithm is performed at the vowel center as well as at the phoneme boundary. Results of evaluation experiments clarify that the proposed algorithm outperforms the conventional algorithm.

Tomoki Toda | Kiyohiro Shikano | Hisashi Kawai | Minoru Tsuzaki

[1] Hideki Kawahara,et al. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[2] Masanobu Abe,et al. A Japanese TTS system based on multiform units and a speech modification algorithm with harmonics reconstruction , 2001, IEEE Trans. Speech Audio Process..

[3] Minoru Tsuzaki. Feature extraction by auditory modeling for unit selection in concatenative speech synthesis , 2001, INTERSPEECH.

[4] John B. Shoven,et al. I , Edinburgh Medical and Surgical Journal.

[5] Nick Campbell,et al. Optimising selection of units from speech databases for concatenative synthesis , 1995, EUROSPEECH.

[6] N. Iwahashi,et al. Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization , 1993 .

[7] Raymond N. J. Veldhuis,et al. Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[8] Hisashi Kawai,et al. Development of a text-to-speech system for Japanese based on waveform splicing , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[9] Yannis Stylianou,et al. Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).