Dynamic Combination of Automatic Speech Recognition Systems by Driven Decoding

Combining automatic speech recognition (ASR) systems generally relies on the posterior merging of the outputs or on acoustic cross-adaptation. In this paper, we propose an integrated approach where outputs of secondary systems are integrated in the search algorithm of a primary one. In this driven decoding algorithm (DDA), the secondary systems are viewed as observation sources that should be evaluated and combined to others by a primary search algorithm. DDA is evaluated on a subset of the ESTER I corpus consisting of 4 hours of French radio broadcast news. Results demonstrate DDA significantly outperforms vote-based approaches: we obtain an improvement of 14.5% relative word error rate over the best single-systems, as opposed to the the 6.7% with a ROVER combination. An in-depth analysis of the DDA shows its ability to improve robustness (gains are greater in adverse conditions) and a relatively low dependency on the search algorithm. The application of DDA to both and beam-search-based decoder yields similar performances.

[1]  Hermann Ney,et al.  iROVER: Improving System Combination with Classification , 2007, NAACL.

[2]  Harriet J. Nock,et al.  Loosely coupled HMMs for ASR , 2000, INTERSPEECH.

[3]  Hermann Ney,et al.  Frame based system combination and a comparison with weighted ROVER and CNC , 2006, INTERSPEECH.

[4]  Ananth Sankar Bayesian model combination (BAYCOM) for improved recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Robert P. W. Duin,et al.  Comparison Between Product and Mean Classi er Combination Rules , 2009 .

[6]  Georges Linarès,et al.  Combined low level and high level features for out-of-vocabulary word detection , 2009, INTERSPEECH.

[7]  John D. Lafferty,et al.  Cheating with imperfect transcripts , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Guillaume Gravier,et al.  The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[9]  Vaibhava Goel,et al.  Segmental minimum Bayes-risk ASR voting strategies , 2000, INTERSPEECH.

[10]  Rong Zhang,et al.  Investigations of issues for using multiple acoustic models to improve continuous speech recognition , 2006, INTERSPEECH.

[11]  Hermann Ney,et al.  Acoustic feature combination for robust speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Georges Linarès,et al.  Frame-based acoustic feature integration for speech understanding , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Georges Linarès,et al.  Scalable language model look-ahead for LVCSR , 2005, INTERSPEECH.

[14]  I-Fan Chen,et al.  A new framework for system combination based on integrated hypothesis space , 2006, INTERSPEECH.

[15]  Georges Linarès,et al.  Generalized driven decoding for speech recognition system combination , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[17]  Georges Linarès,et al.  System Combination by Driven Decoding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Georges Linarès,et al.  Integrating imperfect transcripts into speech recognition systems for building high-quality corpora , 2012, Comput. Speech Lang..

[19]  Brian Kingsbury,et al.  Constructing ensembles of ASR systems using randomized decision trees , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20]  Mark J. F. Gales,et al.  Complementary System Generation using Directed Decision Trees , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Jean-Luc Gauvain,et al.  Combining multiple speech recognizers using voting and language model information , 2000, INTERSPEECH.

[22]  Paul Deléglise,et al.  Automatic Detection of Well Recognized Words in Automatic Speech Transcriptions , 2006, LREC.

[23]  Richard M. Stern,et al.  The 1996 Hub-4 Sphinx-3 System , 1997 .

[24]  Pascale Sébillot,et al.  Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition , 2010, Comput. Speech Lang..

[25]  Paul Deléglise,et al.  The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news , 2005, INTERSPEECH.

[26]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[27]  Richard M. Schwartz,et al.  The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system , 2005, INTERSPEECH.

[28]  Jean-François Bonastre,et al.  ALIZE, a free toolkit for speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[29]  Sebastian Stüker,et al.  Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end , 2006, INTERSPEECH.