Why word error rate is not a good metric for speech recognizer training for the speech translation task?

Speech translation (ST) is an enabling technology for cross-lingual oral communication. A ST system consists of two major components: an automatic speech recognizer (ASR) and a machine translator (MT). Nowadays, most ASR systems are trained and tuned by minimizing word error rate (WER). However, WER counts word errors at the surface level. It does not consider the contextual and syntactic roles of a word, which are often critical for MT. In the end-to-end ST scenarios, whether WER is a good metric for the ASR component of the full ST system is an open issue and lacks systematic studies. In this paper, we report our recent investigation on this issue, focusing on the interactions of ASR and MT in a ST system. We show that BLEU-oriented global optimization of ASR system parameters improves the translation quality by an absolute 1.5% BLEU score, while sacrificing WER over the conventional, WER-optimized ASR system. We also conducted an in-depth study on the impact of ASR errors on the final ST output. Our findings suggest that the speech recognizer component of the full ST system should be optimized by translation metrics instead of the traditional WER.

[1]  D. Anderson,et al.  Algorithms for minimization without derivatives , 1974 .

[2]  Gökhan Tür,et al.  Automatic disfluency removal for improving spoken language translation , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Wu Chou,et al.  Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[4]  Tanja Schultz,et al.  Using word latice information for a tighter coupling in speech translation systems , 2004, INTERSPEECH.

[5]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[6]  Bowen Zhou,et al.  On Efficient Coupling of ASR and SMT for Speech Translation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[8]  F. Casacuberta,et al.  Recent efforts in spoken language translation , 2008, IEEE Signal Processing Magazine.

[9]  Hermann Ney,et al.  Speech translation: coupling of recognition and translation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[11]  Li Deng,et al.  A novel decision function and the associated decision-feedback learning for speech translation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Taro Watanabe,et al.  A Unified Approach in Speech-to-Speech Translation: Integrating Features of Speech recognition and Machine Translation , 2004, COLING.

[13]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[14]  J. Treichler Signal processing: A view of the future, Part 2 [Exploratory DSP] , 2009, IEEE Signal Processing Magazine.

[15]  J. Treichler Signal processing: A view of the future, Part 1 [Exploratory DSP] , 2009, IEEE Signal Processing Magazine.

[16]  Richard Zens,et al.  Speech Translation by Confusion Network Decoding , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  H. Ney,et al.  Phrase-based translation of speech recognizer word lattices using loglinear model combination , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[19]  Hermann Ney,et al.  Integrating Speech Recognition and Machine Translation: Where do We Stand? , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  Dong Yu,et al.  An Integrative and Discriminative Technique for Spoken Utterance Classification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[22]  Richard M. Schwartz,et al.  Integrating Speech Recognition and Machine Translation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.