Phoneme and Sentence-Level Ensembles for Speech Recognition

We address the question of whether and how boosting and bagging can be used for speech recognition. In order to do this, we compare two different boosting schemes, one at the phoneme level and one at the utterance level, with a phoneme-level bagging scheme. We control for many parameters and other choices, such as the state inference scheme used. In an unbiased experiment, we clearly show that the gain of boosting methods compared to a single hidden Markov model is in all cases only marginal, while bagging significantly outperforms all other methods. We thus conclude that bagging methods, which have so far been overlooked in favour of boosting, should be examined more closely as a potentially useful ensemble learning technique for speech recognition.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Herbert Gish,et al.  Discriminatively Trained GMMs for Language Classification Using Boosting Methods , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Mathew Magimai.-Doss,et al.  Using Auxiliary Sources of Knowledge for Automatic Speech Recognition , 2005 .

[4]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[5]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[6]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[7]  Anthony J. Robinson,et al.  Boosting the performance of connectionist large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[10]  Mark J. F. Gales,et al.  The Application of Hidden Markov Models in Speech Recognition , 2007, Found. Trends Signal Process..

[11]  Christos Dimitrakakis,et al.  Ensembles for sequence learning , 2006 .

[12]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[13]  Daniel P. W. Ellis,et al.  LP-TRAP: linear predictive temporal patterns , 2004, INTERSPEECH.

[14]  Hervé Glotin,et al.  Multi-stream adaptive evidence combination for noise robust ASR , 2001, Speech Commun..

[15]  Samy Bengio,et al.  A New Speech Recognition Baseline System for Numbers 95 Version 1.3 Based on Torch , 2004 .

[16]  Geoffrey Zweig,et al.  Boosting Gaussian mixtures in an LVCSR system , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  Rong Zhang,et al.  Comparative study of boosting and non-boosting training for constructing ensembles of acoustic models , 2003, INTERSPEECH.

[18]  Hervé Bourlard,et al.  Spectral Entropy Feature in Multi-stream for Robust ASR , 2005 .

[19]  Lalit R. Bahl,et al.  A new algorithm for the estimation of hidden Markov model parameters , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[20]  Brian Kingsbury,et al.  Constructing ensembles of ASR systems using randomized decision trees , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[21]  Samy Bengio,et al.  Developing and enhancing posterior based speech recognition systems , 2005, INTERSPEECH.

[22]  Hauke Schramm,et al.  Boosting HMM acoustic models in large vocabulary speech recognition , 2006, Speech Commun..

[23]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Rong Zhang,et al.  A frame level boosting training scheme for acoustic modeling , 2004, INTERSPEECH.

[25]  Samy Bengio,et al.  Boosting word error rates , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[26]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[27]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[28]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[29]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[30]  Holger Schwenk,et al.  Using boosting to improve a hybrid HMM/neural network speech recognizer , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[31]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[32]  Mark J. F. Gales,et al.  Directed decision trees for generating complementary systems , 2009, Speech Commun..

[33]  Peter L. Bartlett,et al.  Improved Generalization Through Explicit Optimization of Margins , 2000, Machine Learning.

[34]  Herbert Gish,et al.  Boosting with anti-models for automatic language identification , 2007, INTERSPEECH.

[35]  Rong Zhang,et al.  Apply n-best list re-ranking to acoustic model combinations of boosting training , 2004, INTERSPEECH.

[36]  Rong Zhang,et al.  Making an Effective Use of Speech Data for Acoustic Modeling , 2007 .

[37]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[38]  Samy Bengio,et al.  Boosting HMMs with an application to speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Samy Bengio,et al.  Hierarchical Multi-stream Posterior Based Speech Recognition System , 2005, MLMI.

[40]  Hauke Schramm,et al.  Efficient integration of multiple pronunciations in a large vocabulary decoder , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[41]  H. Bourlard,et al.  Unsupervised spectral subtraction for noise-robust ASR , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..