Optimizing sub-cost functions for segment selection based on perceptual evaluations in concatenative speech synthesis

In concatenative speech synthesis, various factors affect the naturalness of synthetic speech. A cost for segment selection is calculated by integrating some sub-costs capturing the degradation of naturalness caused by such factors. In this paper, we optimize each sub-cost function for converting a linguistic feature or an acoustic parameter into a sub-cost based on perceptual evaluations. Two types of perceptual experiments are performed with test sets constructed by controlling the variations of sub-costs to evaluate the independent effect of each sub-cost and the interactions between them. We clarify the effectiveness of perceptually optimizing subcost functions from a result of a preference test comparing synthetic speech before and after the optimization.

[1]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[2]  Nick Campbell,et al.  Improving speech synthesis of CHATR using a perceptual discontinuity function and constraints of prosodic modification , 1998, SSW.

[3]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[4]  Raymond N. J. Veldhuis,et al.  Reducing audible spectral discontinuities , 2001, IEEE Trans. Speech Audio Process..

[5]  Tomoki Toda,et al.  Perceptual evaluation of cost for segment selection in concatenative speech synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[6]  Tomoki Toda,et al.  Optimizing integrated cost function for segment selection in concatenative speech synthesis based on perceptual evaluations , 2003, INTERSPEECH.

[7]  M. Ostendorf,et al.  A bootstrapping approach to automating prosodic annotation for limited-domain synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[8]  Y. Sagisaka,et al.  Acceptability for temporal modification of single vowel segments in isolated words. , 1998, The Journal of the Acoustical Society of America.

[9]  Alan W. Black,et al.  Limited domain synthesis , 2000, INTERSPEECH.

[10]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Ann K. Syrdal Phonetic effects on listener detection of vowel concatenation , 2001, INTERSPEECH.

[12]  Hisashi Kawai,et al.  Acoustic measures vs. phonetic features as predictors of audible discontinuity in concatenative speech synthesis , 2002, INTERSPEECH.