Analytic generation of synthesis units by closed loop training for totally speaker driven text to speech system (TOS drive TTS)

This paper provides a new method for automatically generating speech synthesis units. The algorithm, called Closed-Loop Training (CLT), is based on evaluating and reducing the distortion in synthesized speech. It minimizes distortion caused by synthesis process such as prosodic modification in an analytic way. The distortion is measured by calculating the error between synthesized speech units and natural speech units in a large speech database (corpus). The CLT method effectively generates the synthesis units that are most resembling of natural speech after synthesis process. In this paper, CLT is applied to a waveform concatenation based synthesizer, whose basic unit is a diphone. By using CLT, the synthesizer generates clear and smooth synthetic speech even with a relatively small volume of synthesis units.

[1]  Takehiko Kagoshima,et al.  Automatic generation of speech synthesis units based on closed loop training , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Takehiko Kagoshima,et al.  An F0 contour control model for totally speaker driven text to speech system , 1998, ICSLP.

[3]  Takehiko Kagoshima,et al.  Automatic rule generation for linguistic features analysis using inductive learning technique: linguistic features analysis in TOS drive TTS system , 1998, ICSLP.

[4]  Eric Moulines,et al.  A diphone synthesis system based on time-domain prosodic modifications of speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[5]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Shin'ya Nakajima,et al.  A new waveform speech synthesis approach based on the COC speech spectrum , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Luis A. Hernández Gómez,et al.  Automatic prosodic modeling for speaker and task adaptation in text-to-speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.