Probability based prosody model for unit selection

Most modern text-to-speech (TTS) systems are unit selection style. In this kind of system, the predicted prosody values, such as pitch, duration and energy values for each synthesis unit, are important factors to conduct unit selection. We present a probability based prosody model in which the distribution of prosody values in a given context equivalent cluster is described by a Gaussian mixture model (GMM), and the distance between a candidate unit and the context equivalent cluster is defined by the GMM probability output. A novel framework for unit selection style TTS systems is derived from the model, and a series of experiments are done on the framework.

[1]  Michael Picheny,et al.  Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Hiroya Fujisaki,et al.  Dynamic Characteristics of Voice Fundamental Frequency in Speech and Singing , 1983 .

[3]  Yi Xu,et al.  A pitch target approximation model for F0 contours in Mandarin , 1999 .

[4]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[5]  Wei Zhang,et al.  Corpus building for data-driven TTS systems , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[6]  Chilin Shih,et al.  Stem-ML: language-independent prosody description , 2000, INTERSPEECH.

[7]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[8]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[9]  Wei Zhang,et al.  Statistic prosody structure prediction , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[10]  N. Kambhatla Local models and Gaussian mixture models for statistical data processing , 1996 .

[11]  Jan P. H. van Santen,et al.  Assignment of segmental duration in text-to-speech synthesis , 1994, Comput. Speech Lang..