Learning continuous representation of text for phone duration modeling in statistical parametric speech synthesis

In this paper, we investigate the usage of a continuous representation based approach of the feature vector derived from input text to predict the phone durations in a Text to Speech(TTS) system. We pose the problem of predicting the duration as a data driven statistical transformation from the input text onto the feature space. First we present a method to map both the categorical and numeric features that are typically used into a continuous numeric representation and then model it as a form of Matrix Factorization to improve the representation. The proposed system is evaluated based on Root Mean Squared Error(RMSE) as the objective measure and Mean Opinion Score(MOS) as the subjective measure. We find that the system performs on par with the state of the art duration modeling systems both subjectively and objectively.

[1]  Susan T. Dumais,et al.  Learned Vector-Space Models for Document Retrieval , 1995, Inf. Process. Manag..

[2]  Alan W. Black,et al.  Issues in building general letter to sound rules , 1998, SSW.

[3]  Paul Taylor,et al.  The architecture of the Festival speech synthesis system , 1998, SSW.

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Katie McGrath,et al.  Language Identification and Language Specific Letter-to-Sound Rules , 2004 .

[6]  Simon King,et al.  Bayesian networks for phone duration prediction , 2008, Speech Commun..

[7]  Bayya Yegnanarayana,et al.  Modeling syllable duration in Indian languages using neural networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yiyu Yao,et al.  An analysis of vector space models based on computational geometry , 1992, SIGIR '92.

[10]  Jan P. H. van Santen,et al.  Contextual effects on vowel duration , 1992, Speech Commun..

[11]  Takao Kobayashi,et al.  Phone duration modeling using gradient tree boosting , 2008, Speech Commun..

[12]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[13]  Alan W. Black,et al.  CLUSTERGEN: a statistical parametric synthesizer using trajectory modeling , 2006, INTERSPEECH.

[14]  Simon King,et al.  Multidimensional scaling of listener responses to synthetic speech , 2005, INTERSPEECH.

[15]  Mohsen Rashwan,et al.  Duration modeling for arabic text to speech synthesis , 2002, INTERSPEECH.

[16]  Rodney W. Johnson,et al.  Automatic translation of english text to phonetics by means of letter-to-sound rules (nrl report 794 , 1976 .

[17]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[18]  Oliver Watts,et al.  Unsupervised learning for text-to-speech synthesis , 2013 .

[19]  Ganesh Ramakrishnan,et al.  MILE TTS for Tamil and Kannada for blizzard challenge 2013 , 2013 .

[20]  Alan W Black,et al.  Festvox : Tools for Creation and Analyses of Large Speech Corpora , 2010 .

[21]  Marcel Riedi,et al.  Modeling segmental duration with multivariate adaptive regression splines , 1997, EUROSPEECH.

[22]  Alan W. Black,et al.  Letter to sound rules for accented lexicon compression , 1998, ICSLP.

[23]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[24]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[25]  Dennis H. Klatt,et al.  Perception of Segment Duration in Sentence Contexts , 1975 .

[26]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[27]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[28]  Keiichi Tokuda,et al.  The blizzard challenge - 2005: evaluating corpus-based speech synthesis on common datasets , 2005, INTERSPEECH.

[29]  Nikos Fakotakis,et al.  Improving phone duration modelling using support vector regression fusion , 2011, Speech Commun..

[30]  Kishore Prahallad,et al.  Automatic Building of Synthetic Voices from Audio Books , 2010 .

[31]  Xiaochuan Niu,et al.  Prediction and synthesis of prosodic effects on spectral balance of vowels , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[32]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.