论文信息 - Dynamic Combination of Automatic Speech Recognition Systems by Driven Decoding

Dynamic Combination of Automatic Speech Recognition Systems by Driven Decoding

Combining automatic speech recognition (ASR) systems generally relies on the posterior merging of the outputs or on acoustic cross-adaptation. In this paper, we propose an integrated approach where outputs of secondary systems are integrated in the search algorithm of a primary one. In this driven decoding algorithm (DDA), the secondary systems are viewed as observation sources that should be evaluated and combined to others by a primary search algorithm. DDA is evaluated on a subset of the ESTER I corpus consisting of 4 hours of French radio broadcast news. Results demonstrate DDA significantly outperforms vote-based approaches: we obtain an improvement of 14.5% relative word error rate over the best single-systems, as opposed to the the 6.7% with a ROVER combination. An in-depth analysis of the DDA shows its ability to improve robustness (gains are greater in adverse conditions) and a relatively low dependency on the search algorithm. The application of DDA to both and beam-search-based decoder yields similar performances.

Georges Linarès | Benjamin Lecouteux | Guillaume Gravier | Yannick Estève

[1] Hermann Ney,et al. iROVER: Improving System Combination with Classification , 2007, NAACL.

[2] Harriet J. Nock,et al. Loosely coupled HMMs for ASR , 2000, INTERSPEECH.

[3] Hermann Ney,et al. Frame based system combination and a comparison with weighted ROVER and CNC , 2006, INTERSPEECH.

[4] Ananth Sankar. Bayesian model combination (BAYCOM) for improved recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5] Robert P. W. Duin,et al. Comparison Between Product and Mean Classi er Combination Rules , 2009 .

[6] Georges Linarès,et al. Combined low level and high level features for out-of-vocabulary word detection , 2009, INTERSPEECH.

[7] John D. Lafferty,et al. Cheating with imperfect transcripts , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8] Guillaume Gravier,et al. The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[9] Vaibhava Goel,et al. Segmental minimum Bayes-risk ASR voting strategies , 2000, INTERSPEECH.

[10] Rong Zhang,et al. Investigations of issues for using multiple acoustic models to improve continuous speech recognition , 2006, INTERSPEECH.

[11] Hermann Ney,et al. Acoustic feature combination for robust speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12] Georges Linarès,et al. Frame-based acoustic feature integration for speech understanding , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Georges Linarès,et al. Scalable language model look-ahead for LVCSR , 2005, INTERSPEECH.

[14] I-Fan Chen,et al. A new framework for system combination based on integrated hypothesis space , 2006, INTERSPEECH.

[15] Georges Linarès,et al. Generalized driven decoding for speech recognition system combination , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16] Gunnar Evermann,et al. Posterior probability decoding, confidence estimation and system combination , 2000 .

[17] Georges Linarès,et al. System Combination by Driven Decoding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18] Georges Linarès,et al. Integrating imperfect transcripts into speech recognition systems for building high-quality corpora , 2012, Comput. Speech Lang..

[19] Brian Kingsbury,et al. Constructing ensembles of ASR systems using randomized decision trees , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20] Mark J. F. Gales,et al. Complementary System Generation using Directed Decision Trees , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21] Jean-Luc Gauvain,et al. Combining multiple speech recognizers using voting and language model information , 2000, INTERSPEECH.

[22] Paul Deléglise,et al. Automatic Detection of Well Recognized Words in Automatic Speech Transcriptions , 2006, LREC.

[23] Richard M. Stern,et al. The 1996 Hub-4 Sphinx-3 System , 1997 .

[24] Pascale Sébillot,et al. Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition , 2010, Comput. Speech Lang..

[25] Paul Deléglise,et al. The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news , 2005, INTERSPEECH.

[26] Jonathan G. Fiscus,et al. A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[27] Richard M. Schwartz,et al. The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system , 2005, INTERSPEECH.

[28] Jean-François Bonastre,et al. ALIZE, a free toolkit for speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[29] Sebastian Stüker,et al. Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end , 2006, INTERSPEECH.