Directed decision trees for generating complementary systems

Many large vocabulary continuous speech recognition systems use a combination of multiple systems to obtain the final hypothesis. These complementary systems are typically found in an ad-hoc manner, by testing combinations of diverse systems and selecting the best. This paper presents a new algorithm for generating complementary systems by altering the decision tree generation, and a divergence measure for comparing decision trees. In this paper, the decision tree is biased against clustering states which have previously led to confusions. This leads to a system which concentrates states in contexts that were previously confusable. Thus these systems tend to make different errors. Results are presented on two broadcast news tasks - Mandarin and Arabic. The results show that combining multiple systems built from directed decision trees give gains in performance when confusion network combination is used as the method of combination. The results also show that the gains achieved using the directed tree algorithm are additive to the gains achieved using other techniques that have been empirically shown as complementary.

[1]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[2]  Stephen Cox,et al.  Some statistical issues in the comparison of speech recognition algorithms , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[3]  Simon King,et al.  IEEE Workshop on automatic speech recognition and understanding , 2009 .

[4]  Yuxin Zhao,et al.  Novel Lookahead Decision Tree State Tying for Acoustic Modeling , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Mark J. F. Gales,et al.  Generating Complementary Systems for Speech Recognition , 2022 .

[6]  Wen Wang,et al.  Advances in Mandarin broadcast speech recognition , 2007, INTERSPEECH.

[7]  Hermann Ney,et al.  Frame based system combination and a comparison with weighted ROVER and CNC , 2006, INTERSPEECH.

[8]  Carsten Meyer Utterance-level boosting of HMM speech recognizers , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Mark J. F. Gales,et al.  The Cu-Htk Mandarin Broadcast News Transcription System , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[10]  Rong Zhang,et al.  Improving the performance of an LVCSR system through ensembles of acoustic models , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Mark J. F. Gales,et al.  Building multiple complementary systems using directed decision trees , 2007, INTERSPEECH.

[14]  Andreas Stolcke,et al.  Finding consensus among words: lattice-based word error minimization , 1999, EUROSPEECH.

[15]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[16]  Mark J. F. Gales,et al.  A comparative study of methods for phonetic decision-tree state clustering , 1997, EUROSPEECH.

[17]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[18]  Bhuvana Ramabhadran,et al.  The IBM 2007 speech transcription system for European parliamentary speeches , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[19]  Geoffrey Zweig,et al.  Boosting Gaussian mixtures in an LVCSR system , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[20]  John H. L. Hansen,et al.  Selective training for hidden Markov models with applications to speech classification , 1999, IEEE Trans. Speech Audio Process..

[21]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[22]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[23]  Sebastian Stüker,et al.  Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end , 2006, INTERSPEECH.

[24]  Jing Huang,et al.  Detection, diarization, and transcription of far-field lecture speech , 2007, INTERSPEECH.

[25]  Mark J. F. Gales,et al.  Development of a phonetic system for large vocabulary Arabic speech recognition , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[26]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[27]  Holger Schwenk,et al.  Using boosting to improve a hybrid HMM/neural network speech recognizer , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[28]  Yunxin Zhao,et al.  A Bayesian Approach for Phonetic Decision Tree State Tying in Conversational Speech Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29]  Samy Bengio,et al.  Boosting HMMs with an application to speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Lukás Burget,et al.  The AMI System for the Transcription of Speech in Meetings , 2007, ICASSP.

[31]  Brian Kingsbury,et al.  Constructing ensembles of ASR systems using randomized decision trees , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[32]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[33]  Gerard G. L. Meyer,et al.  Word-selective training for speech recognition , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[34]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[35]  Tomoki Toda,et al.  Utterance-Based Selective Training for the Automatic Creation of Task-Dependent Acoustic Models , 2006, IEICE Trans. Inf. Syst..

[36]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[37]  Richard M. Schwartz,et al.  Progress in transcription of Broadcast News using Byblos , 2002, Speech Commun..

[38]  Mark J. F. Gales,et al.  Progress in the CU-HTK broadcast news transcription system , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[39]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.