A new method of generating speech synthesis units based on phonological knowledge and clustering technique

This paper proposes a new method for generating synthesis units using context dependent phonemes to achieve high quality text-to-speech (TTS) synthesis. If all phoneme triplets (triphones) in Japanese are considered, the number of synthesis units is very large; therefore, we introduce two techniques to reduce the number of synthesis units. The first technique decreases approximately 15,000 triphones to about 6,000 triphones based on phonological knowledge. The second technique is based on a segment quantization, which reduces the number of units even more. Experimental tests show that the proposed method is effective in improving articulation and intelligibility scores, that the number of synthesis units can be decreased without significant loss in TTS quality, and that the preference score is proportional to the number of synthesis units.

[1]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  Tomohisa Hirokawa,et al.  High quality speech synthesis based on wavelet compilation of phoneme segments , 1992, ICSLP.

[3]  S. Nakajima,et al.  Automatic generation of synthesis units based on context oriented clustering , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[5]  Hideyuki Mizuno,et al.  A new Japanese text-to-speech synthesizer based on COC synthesis method , 1990, ICSLP.

[6]  Masaaki Honda,et al.  LPC speech coding based on variable-length segment quantization , 1988, IEEE Trans. Acoust. Speech Signal Process..

[7]  S. Roucos,et al.  Segment quantization for very-low-rate speech coding , 1982, ICASSP.