论文信息 - System Combination by Driven Decoding

System Combination by Driven Decoding

The combination of automatic speech recognition (ASR) systems generally relies on a posteriori merge of system outputs or on a cross-adaptation. In this paper, we propose an integrated approach where the search of a primary system is driven by the outputs of a secondary one. This method allows to drive the primary system search by using the one-best hypotheses and the word posteriors gathered from the secondary system. Experiments are carried out within the experimental framework of the ESTER evaluation campaign (S. Galliano et al. 2005). Results show that the driven decoding algorithm significantly outperforms the two single ASR systems (-8% of relative WER, -1.7% absolute). Finally, we investigate the interactions between driven decoding and cross-adaptations. The best cross-adaptation strategy in combination with the driven decoding process brings to a final absolute gain of about 1.9% WER.

Georges Linarès | Benjamin Lecouteux | Julie Mauclair | Yannick Estève

[1] Richard M. Schwartz,et al. The 2004 BBN/LIMSI 20xRT English conversational telephone speech recognition system , 2005, INTERSPEECH.

[2] Paul Deléglise,et al. Automatic Detection of Well Recognized Words in Automatic Speech Transcriptions , 2006, LREC.

[3] Andreas Stolcke,et al. Finding consensus in speech recognition: word error minimization and other applications of confusion networks , 2000, Comput. Speech Lang..

[4] Guillaume Gravier,et al. The ESTER phase II evaluation campaign for the rich transcription of French broadcast news , 2005, INTERSPEECH.

[5] Georges Linarès,et al. Imperfect transcript driven speech recognition , 2006, INTERSPEECH.

[6] I-Fan Chen,et al. A new framework for system combination based on integrated hypothesis space , 2006, INTERSPEECH.

[7] Richard M. Stern,et al. The 1997 CMU Sphinx-3 English Broadcast News Transcription System , 1997 .

[8] Jean-François Bonastre,et al. ALIZE, a free toolkit for speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9] Paul Deléglise,et al. The LIUM speech transcription system: a CMU Sphinx III-based system for French broadcast news , 2005, INTERSPEECH.

[10] Jonathan G. Fiscus,et al. A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.