RIch-context Unit Selection (RUS) approach to high quality TTS

This paper presents a Rich-context Unit Selection (RUS) approach to high quality speech synthesis. Based upon our previous work on rich context modeling, we use the corresponding parametric HMMs to represent waveform units and form a “sausage-like” lattice. A prune-and-search procedure is proposed, in which Kullback-Leibler divergence is adopted to select potential candidate units, and normalized cross-correlation is used as the final objective measure to search for the optimal unit path. The maximum cross-correlation criterion provides the optimal concatenation between successive units, in terms of spectral similarity, phase continuity and best connecting timing instants. Subjectively, both preference and MOS tests were conducted to compare RUS with our current Weight-table based Unit Selection (WUS) synthesis. Experimental results show that the voice quality of synthesized speech is significantly improved by RUS over the conventional WUS.

[1]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Bhuvana Ramabhadran,et al.  The IBM Submission to the 2008 Text-to-Speech Blizzard Challenge , 2008 .

[3]  Yong Zhao,et al.  Microsoft Mulan - a bilingual TTS system , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[5]  Alex Acero,et al.  Recent improvements on Microsoft's trainable text-to-speech system-Whistler , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Zhi-Jie Yan,et al.  Rich context modeling for high quality HMM-based TTS , 2009, INTERSPEECH.

[7]  Ren-Hua Wang,et al.  HMM-Based Hierarchical Unit Selection Combining Kullback-Leibler Divergence with Likelihood Criterion , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Frank K. Soong,et al.  A cross-language state mapping approach to bilingual (Mandarin-English) TTS , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.