论文信息 - Optimizing sub-cost functions for segment selection based on perceptual evaluations in concatenative speech synthesis

Optimizing sub-cost functions for segment selection based on perceptual evaluations in concatenative speech synthesis

In concatenative speech synthesis, various factors affect the naturalness of synthetic speech. A cost for segment selection is calculated by integrating some sub-costs capturing the degradation of naturalness caused by such factors. In this paper, we optimize each sub-cost function for converting a linguistic feature or an acoustic parameter into a sub-cost based on perceptual evaluations. Two types of perceptual experiments are performed with test sets constructed by controlling the variations of sub-costs to evaluate the independent effect of each sub-cost and the interactions between them. We clarify the effectiveness of perceptually optimizing subcost functions from a result of a preference test comparing synthetic speech before and after the optimization.

Tomoki Toda | Hisashi Kawai | Minoru Tsuzaki

[1] Y. Sagisaka,et al. Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2] Nick Campbell,et al. Improving speech synthesis of CHATR using a perceptual discontinuity function and constraints of prosodic modification , 1998, SSW.

[3] John B. Shoven,et al. I , Edinburgh Medical and Surgical Journal.

[4] Raymond N. J. Veldhuis,et al. Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[5] Tomoki Toda,et al. Perceptual evaluation of cost for segment selection in concatenative speech synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[6] Tomoki Toda,et al. Optimizing integrated cost function for segment selection in concatenative speech synthesis based on perceptual evaluations , 2003, INTERSPEECH.

[7] M. Ostendorf,et al. A bootstrapping approach to automating prosodic annotation for limited-domain synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[8] Y. Sagisaka,et al. Acceptability for temporal modification of single vowel segments in isolated words. , 1998, The Journal of the Acoustical Society of America.

[9] Alan W. Black,et al. Limited domain synthesis , 2000, INTERSPEECH.

[10] Yannis Stylianou,et al. Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11] Ann K. Syrdal. Phonetic effects on listener detection of vowel concatenation , 2001, INTERSPEECH.

[12] Hisashi Kawai,et al. Acoustic measures vs. phonetic features as predictors of audible discontinuity in concatenative speech synthesis , 2002, INTERSPEECH.