Speaker specific phrase break modeling with conditional random fields for text-to-speech

In this paper we present a new cascading condi­tional random field based phrase break model for text-to-speech systems, trained on the speaker specific acoustic data that the text-to-speech voices are trained on. The training phase does not require any manually labeled phrase break tags, as these are derived directly from the speaker specific recordings used for building the synthetic voices. We present objective evaluations on various corpora, and show that the proposed model compares well with state-of-the-art data-driven phrase break models, with the added benefit of being in a unified framework.

[1]  Johannes A. Louw Speect: a multilingual text-to-speech system , 2008 .

[2]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[3]  Kishore Prahallad,et al.  Learning speaker-specific phrase breaks for text-to-speech systems , 2010, SSW.

[4]  Stephen Cox,et al.  Stochastic and syntactic techniques for predicting phrase breaks , 2007, Comput. Speech Lang..

[5]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[6]  L. Guibas,et al.  Finding color and shape patterns in images , 1999 .

[7]  Alan W. Black,et al.  Data-driven phrasing for speech synthesis in low-resource languages , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[10]  Noah A. Smith,et al.  Weighted and Probabilistic Context-Free Grammars Are Equally Expressive , 2007, CL.

[11]  Steve Young,et al.  The HTK book , 1995 .

[12]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[13]  Joseph P. Olive,et al.  Text-to-speech synthesis , 1995, AT&T Technical Journal.

[14]  Kevyn Collins-Thompson,et al.  Prominence prediction for supersentential prosodic modeling based on a new database , 2004, SSW.

[15]  Walter Daelemans,et al.  Predicting phrase breaks with memory-based learning , 2001, SSW.

[16]  Alok Parlikar Style-Specific Phrasing in Speech Synthesis , 2013 .