Iterative unit selection with unnatural prosody detection

Corpus-driven speech synthesis is hampered by the occurrence of occasional glitches which ruin the impression of the whole utterance. We propose an iterative unit selection integrated with an unnatural prosody detection model to identify any unnatural prosody. The system searches an optimal path in the lattice, verifies its naturalness by the unnatural prosody model and replaces the bad section with a better candidate, until it passes the verification test. In light of hypothesis testing, we show this trial-and-error approach takes effective advantage of abundant candidate samples in the database. Also, in contrast to conventional prosody prediction, an unnatural prosody detection model still leaves enough room for the prosody variations. Unnaturalness confidence measures are studied. The combined model can reduce the objective distortion by 16.3%. Perceptual experiments also confirm the proposed approach improves the synthetic speech quality appreciably.

[1]  Yong Zhao,et al.  Measuring Target Cost in Unit Selection with Kl-Divergence Between Context-Dependent HMMS , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Keikichi Hirose,et al.  Acoustic characteristics and the underlying rules of intonation of the common Japanese used by radio and television announcers , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Justin Fackrell,et al.  The application of interactive speech unit selection in TTS systems , 2003, INTERSPEECH.

[4]  Robert E. Donovan,et al.  The IBM trainable speech synthesis system , 1998, ICSLP.

[5]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[6]  Yong Zhao,et al.  Microsoft Mulan - a bilingual TTS system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[7]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[8]  Yong Zhao,et al.  Modeling stylized invariance and local variability of prosody in text-to-speech synthesis , 2006, Speech Commun..

[9]  Yannis Stylianou,et al.  Perceptual and objective detection of discontinuities in concatenative speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).