论文信息 - Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering

Automatic synthesis unit generation for English speech synthesis based on multi-layered context oriented clustering

Abstract In this paper, we propose a new synthesis unit learning method aiming at multi-lingual speech synthesis and describe its application to English speech synthesis. The method termed Multi-Layered Context Oriented Clustering (ML-COC) is a generalized framework of the COC method which has been applied to Japanese speech synthesis. The conventional COC method produces a set of phonetic context dependent units through a cluster splitting process. In ML-COC, the notion of context is generalized and the factors other than phonetic context, such as stressing and syntactical boundaries, are taken into account to capture the richer phoneme variations of English. A synthesis unit generation experiment shows that ML-COC produces about three times as many synthesis units as the conventional COC (Single-Layered COC: SL-COC) method, and the average intra-cluster variance of ML-COC units is 20% lower than that of SL-COC. These results suggest that the ML-COC synthesis units reflect the phonological structure of English much more appropriately than do the SL-COC units. To validate the effectiveness of the ML-COC method, we conducted preference experiments using synthesized speech. The preference test exposed 10 subjects to 52 sentences. The ML-COC method was preferred over the conventional SL-COC method by a score of 70% to 30%.

Shinya Nakajima

[1] Yoshinori Sagisaka,et al. Tree-based unit selection for English speech synthesis , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] Hideyuki Mizuno,et al. A new Japanese text-to-speech synthesizer based on COC synthesis method , 1990, ICSLP.

[3] S. Nakajima,et al. Automatic generation of synthesis units based on context oriented clustering , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4] Shigeki Sagayama,et al. Phoneme environment clustering for speech recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[5] S. Roucos,et al. Segment quantization for very-low-rate speech coding , 1982, ICASSP.

[6] Francis Charpentier,et al. Diphone synthesis using an overlap-add technique for speech waveforms concatenation , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7] O. Fujimura. Syllables as concatenated demisyllables and affixes , 1976 .

[8] Masaaki Honda,et al. LPC speech coding based on variable-length segment quantization , 1988, IEEE Trans. Acoust. Speech Signal Process..

[9] Elisabeth Selkirk,et al. Phonology and Syntax: The Relation between Sound and Structure , 1984 .