Improving phone duration modelling using support vector regression fusion

In the present work, we propose a scheme for the fusion of different phone duration models, operating in parallel. Specifically, the predictions from a group of dissimilar and independent to each other individual duration models are fed to a machine learning algorithm, which reconciles and fuses the outputs of the individual models, yielding more precise phone duration predictions. The performance of the individual duration models and of the proposed fusion scheme is evaluated on the American-English KED TIMIT and on the Greek WCL-1 databases. On both databases, the SVR-based individual model demonstrates the lowest error rate. When compared to the second-best individual algorithm, a relative reduction of the mean absolute error (MAE) and the root mean square error (RMSE) by 5.5% and 3.7% on KED TIMIT, and 6.8% and 3.7% on WCL-1 is achieved. At the fusion stage, we evaluate the performance of 12 fusion techniques. The proposed fusion scheme, when implemented with SVR-based fusion, contributes to the improvement of the phone duration prediction accuracy over the one of the best individual model, by 1.9% and 2.0% in terms of relative reduction of the MAE and RMSE on KED TIMIT, and by 2.6% and 1.8% on the WCL-1 database.

[1]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[2]  Mary P. Harper,et al.  A Parallel Implementation of a Hidden Markov Model with Duration Modeling for Speech Recognition , 1995 .

[3]  Michael Picheny,et al.  Context dependent phonetic duration models for decoding conversational speech , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  David B. Pisoni,et al.  Text-to-speech: the mitalk system , 1987 .

[5]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[6]  Sin-Horng Chen,et al.  An RNN-based prosodic information synthesizer for Mandarin text-to-speech , 1998, IEEE Trans. Speech Audio Process..

[7]  T. Crystal,et al.  Segmental durations in connected‐speech signals: Current results , 1988 .

[8]  Bayya Yegnanarayana,et al.  Modeling durations of syllables using neural networks , 2007, Comput. Speech Lang..

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Zhigang Cao,et al.  Refining segmental boundaries for TTS database using fine contextual-dependent boundary models , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  H. Akaike A new look at the statistical model identification , 1974 .

[12]  Ricardo Vilalta,et al.  A Perspective View and Survey of Meta-Learning , 2002, Artificial Intelligence Review.

[13]  Alan W. Black,et al.  A family-of-models approach to HMM-based segmentation for unit selection speech synthesis , 2004, INTERSPEECH.

[14]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[15]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[16]  Tapio Elomaa,et al.  Selective Rademacher Penalization and Reduced Error Pruning of Decision Trees , 2004, J. Mach. Learn. Res..

[17]  Simon King,et al.  Bayesian networks for phone duration prediction , 2008, Speech Commun..

[18]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[19]  George K. Kokkinakis,et al.  A TtS system for the Greek language based on concatenation of formant coded segments , 1996, Speech Commun..

[20]  Jerome R. Bellegarda,et al.  Statistical prosodic modeling: from corpus design to parameter estimation , 2001, IEEE Trans. Speech Audio Process..

[21]  M. Beckman,et al.  Articulatory Timing and the Prosodic Interpretation of Syllable Duration , 1988 .

[22]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[23]  Colin Yallop,et al.  An Introduction to Phonetics and Phonology , 1990 .

[24]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[25]  Yoshinori Sagisaka,et al.  Statistical modelling of speech segment duration by constrained tree regression , 2000 .

[26]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[28]  Jooyoung Park,et al.  Approximation and Radial-Basis-Function Networks , 1993, Neural Computation.

[29]  Jan P. H. van Santen,et al.  Assignment of segmental duration in text-to-speech synthesis , 1994, Comput. Speech Lang..

[30]  Paul Taylor,et al.  Using bayesian belief networks for model duration in text-to-speech systems , 2000, INTERSPEECH.

[31]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[32]  J. V. Santen,et al.  The analysis of contextual effects on segmental duration , 1990 .

[33]  J. Laver The phonetic description of voice quality , 1980 .

[34]  Yoshinori Sagisaka,et al.  On sentence-level factors governing segmental duration in Japanese , 1989 .

[35]  Jan P. H. van Santen,et al.  Contextual effects on vowel duration , 1992, Speech Commun..

[36]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[37]  Panagiotis Zervas,et al.  Development and evaluation of a prosodic database for Greek speech synthesis and research* , 2008, J. Quant. Linguistics.

[38]  Takao Kobayashi,et al.  Phone duration modeling using gradient tree boosting , 2008, Speech Commun..

[39]  Julia Hirschberg,et al.  Progress in speech synthesis , 1997 .

[40]  Ian H. Witten,et al.  Induction of model trees for predicting continuous classes , 1996 .

[41]  J. Olive,et al.  Text to speech—An overview , 1985 .

[42]  Sin-Horng Chen,et al.  A new duration modeling approach for Mandarin speech , 2003, IEEE Trans. Speech Audio Process..

[43]  Xue Wang,et al.  Modelling of phone duration (using the TIMIT database) and its potential benefit for ASR , 1996, Speech Commun..

[44]  古井 貞煕,et al.  Digital speech processing, synthesis, and recognition , 1989 .

[45]  Hynek Hermansky,et al.  Towards increasing speech recognition error rates , 1995, Speech Commun..

[46]  Nikos Fakotakis,et al.  Duration modelling for the greek language , 1993, EUROSPEECH.

[47]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[48]  Andreas Stolcke,et al.  Modeling duration patterns for speaker recognition , 2003, INTERSPEECH.

[49]  Jean-Luc Gauvain,et al.  Modeling Duration via Lattice Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[50]  John Laver,et al.  Principles of Phonetics: Principles of transcription , 1994 .

[51]  D. Klatt Linguistic uses of segmental duration in English: acoustic and perceptual evidence. , 1976, The Journal of the Acoustical Society of America.

[52]  Antônio R. M. Simões Predicting sound segment duration in connected speech: an acoustical study of brazilian portuguese , 1990, SSW.

[53]  Gérard Bailly,et al.  Talking Machines: Theories, Models, and Designs , 1992 .

[54]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for speech analysis , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[55]  B. Yegnanarayana,et al.  Modeling syllable duration in Indian languages using support vector machines , 2005, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005..

[56]  Chilin Shih,et al.  Duration Study for the Bell Laboratories Mandarin Text-to-Speech System , 1997 .

[57]  Panagiotis Zervas,et al.  Segmental Duration Modeling for Greek Speech Synthesis , 2007 .

[58]  R. Carlson,et al.  A Search for Durational Rules in a Real-Speech Data Base , 1986 .

[59]  J. Friedman Stochastic gradient boosting , 2002 .

[60]  Michael Riley Tree-based modelling for speech synthesis , 1990, SSW.

[61]  Katarina Bartkova,et al.  A model of segmental duration for speech synthesis in French , 1987, Speech Commun..

[62]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[63]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[64]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .