Boosting HMM acoustic models in large vocabulary speech recognition

Abstract Boosting algorithms have been successfully used to improve performance in a variety of classification tasks. Here, we suggest an approach to apply a popular boosting algorithm (called “AdaBoost.M2”) to Hidden Markov Model based speech recognizers, at the level of utterances. In a variety of recognition tasks we show that boosting significantly improves the best test error rates obtained with standard maximum likelihood training. In addition, results in several isolated word decoding experiments show that boosting may also provide further performance gains over discriminative training, when both training techniques are combined. In our experiments this also holds when comparing final classifiers with a similar number of parameters and when evaluating in decoding conditions with lexical and acoustic mismatch to the training conditions. Moreover, we present an extension of our algorithm to large vocabulary continuous speech recognition, allowing online recognition without further processing of N-best lists or word lattices. This is achieved by using a lexical approach for combining different acoustic models in decoding. In particular, we introduce a weighted summation over an extended set of alternative pronunciation models representing both the boosted models and the baseline model. In this way, arbitrarily long utterances can be recognized by the boosted ensemble in a single pass decoding framework. Evaluation results are presented on two tasks: a real-life spontaneous speech dictation task with a 60k word vocabulary and Switchboard.

[1]  H. Ney,et al.  INTERDEPENDENCE OF LANGUAGE MODELS AND DISCRIMINATIVE TRAINING , 2007 .

[2]  Bernhard Rüber,et al.  Obtaining confidence measures from sentence probabilities , 1997, EUROSPEECH.

[3]  Anthony J. Robinson,et al.  Boosting the performance of connectionist large vocabulary speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[5]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[6]  Helmuth Schramm,et al.  Filled-pause modeling for medical transcriptions , 2003 .

[7]  Daniel Povey,et al.  Large scale discriminative training for speech recognition , 2000 .

[8]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[9]  Xavier L. Aubert,et al.  One pass cross word decoding for large vocabularies based on a lexical tree search organization , 1999, EUROSPEECH.

[10]  Geoffrey Zweig,et al.  Boosting Gaussian mixtures in an LVCSR system , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[12]  Jochen Peters LM Studies on Filled Pauses in Spontaneous Medical Dictation , 2003, HLT-NAACL.

[13]  Hermann Ney,et al.  Large vocabulary continuous speech recognition of Broadcast News - The Philips/RWTH approach , 2002, Speech Commun..

[14]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[15]  Gunnar Rätsch,et al.  Robust multi-class boosting , 2003, INTERSPEECH.

[16]  Xavier L. Aubert,et al.  Combined acoustic and linguistic look-ahead for one-pass time-synchronous decoding , 2000, INTERSPEECH.

[17]  Peter Beyerlein,et al.  Towards "Large Margin" Speech Recognizers by Boosting and Discriminative Training , 2002, ICML.

[18]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[19]  Hauke Schramm,et al.  Efficient integration of multiple pronunciations in a large vocabulary decoder , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  Jing Zheng,et al.  Word-level rate of speech modeling using rate-specific phones and pronunciations , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[22]  Georg Rose,et al.  Rival training: efficient use of data in discriminative training , 2000, INTERSPEECH.

[23]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[24]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[26]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[27]  Holger Schwenk,et al.  Using boosting to improve a hybrid HMM/neural network speech recognizer , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[28]  Hauke Schramm,et al.  Investigations on conversational speech recognition , 2001, Interspeech.

[29]  Samy Bengio,et al.  Boosting HMMs with an application to speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Rong Zhang,et al.  Comparative study of boosting and non-boosting training for constructing ensembles of acoustic models , 2003, INTERSPEECH.

[31]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[32]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Carsten Meyer Utterance-level boosting of HMM speech recognizers , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.