Weighted neural network ensemble models for speech prosody control

In text-to-speech synthesis systems, the quality of the predicted prosody contours influences quality and naturalness of synthetic speech. This paper presents a new statistical model for prosody control that combines an ensemble learning technique using neural networks as base learners with feature relevance determination. This weighted neural network ensemble model was applied for both, phone duration modeling and fundamental frequency modeling. A comparison with state-of-the-art prosody models based on classification and regression trees (CART), multivariate adaptive regression splines (MARS), or artificial neural networks (ANN), shows a 12% improvement compared to the best duration model and a 24% improvement compared to the best F0 model. The neural network ensemble model also outperforms another, recently presented ensemble model based on gradient tree boosting. Index Terms: speech synthesis, prosody control, neural networks, ensemble models

[1]  Toshio Odanaka,et al.  ADAPTIVE CONTROL PROCESSES , 1990 .

[2]  Pablo M. Granitto,et al.  Aggregation algorithms for neural network ensemble construction , 2002, VII Brazilian Symposium on Neural Networks, 2002. SBRN 2002. Proceedings..

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  Harald Romsdorfer,et al.  Phonetic labeling and segmentation of mixed-lingual prosody databases , 2005, INTERSPEECH.

[5]  Harris Drucker,et al.  Improving Regressors using Boosting Techniques , 1997, ICML.

[6]  J. Friedman 1999 REITZ LECTURE GREEDY FUNCTION APPROXIMATION: A GRADIENT BOOSTING MACHINE' , 2001 .

[7]  Christof Traber,et al.  SVOX: the implementation of a text-to-speech system for German , 1995 .

[8]  Marcel Plazi Riedi Controlling segmental duration in speech synthesis systems , 1998 .

[9]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[10]  Martin A. Riedmiller,et al.  Fast Network Pruning and Feature Extraction by using the Unit-OBS Algorithm , 1996, NIPS.

[11]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[12]  Gunnar Rätsch,et al.  Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces , 2002, Machine Learning.

[13]  Pablo M. Granitto,et al.  Neural network ensembles: evaluation of aggregation algorithms , 2005, Artif. Intell..

[14]  Philipp Slusallek,et al.  Introduction to real-time ray tracing , 2005, SIGGRAPH Courses.

[15]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[16]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Harald Romsdorfer,et al.  Polyglot text to speech synthesis: text analysis & prosody control , 2009 .

[19]  Sanjoy Dasgupta,et al.  Adaptive Control Processes , 2010, Encyclopedia of Machine Learning and Data Mining.

[20]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[21]  Gregory J. Wolff,et al.  Optimal Brain Surgeon and general network pruning , 1993, IEEE International Conference on Neural Networks.

[22]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[23]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .