Discriminative weight training for unit-selection based speech synthesis

Concatenative speech synthesis by selecting units from large database has become popular due to its high quality in synthesized speech. The units are selected by minimizing the combination of target and join costs for a given sentence. In this paper, we propose a new approach to train the weight parameters associated with the cost functions used for unit selection in concatenative speech synthesis. We first view the unit selection as a classification problem, and apply the discriminative training technique which is found an efficient way to parameter estimation in speech recognition. Instead of defining an objective function which accounts for the subjective speech quality, we take the classification error as the objective function to be optimized. The classification error is approximated by a smooth function and the relevant parameters are updated by means of the gradient descent technique.

[1]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[3]  Sridha Sridharan,et al.  Trainable speech synthesis with trended hidden Markov models , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Marc C. Beutnagel,et al.  The AT & T NEXT-GEN TTS system , 1999 .

[5]  Paul Taylor,et al.  Automatically clustering similar units for unit selection in speech synthesis , 1997, EUROSPEECH.

[6]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.