Applying Automated Metrics to Speech Translation Dialogs

Over the past five years, the Defense Advanced Research Projects Agency (DARPA) has funded development of speech translation systems for tactical applications. A key component of the research program has been extensive system evaluation, with dual objectives of assessing progress overall and comparing among systems. This paper describes the methods used to obtain BLEU, TER, and METEOR scores for two-way English-Iraqi Arabic systems. We compare the scores with measures based on human judgments and demonstrate the effects of normalization operations on BLEU scores. Issues that are highlighted include the quality of test data and differential results of applying automated metrics to Arabic vs. English.

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  Joseph P. Turian,et al.  Evaluation of machine translation and its evaluation , 2003, MTSUMMIT.

[3]  Ying Zhang,et al.  Measuring confidence intervals for the machine translation evaluation metrics , 2004, TMI.

[4]  Gregory A. Sanders,et al.  Odds of Successful Transfer of Low-Level Concepts: a Key Metric for Bidirectional Speech-to-Speech Machine Translation in DARPA’s TRANSTAC Program , 2008, LREC.

[5]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[6]  Kristin Precoda,et al.  A Fine-Grained Evaluation Method for Speech-to-Speech Machine Translation Using Concept Annotations , 2004, LREC.

[7]  Taro Watanabe,et al.  Evaluation of a Practical Interlingua for Task-Oriented Dialogue , 2000 .

[8]  R. Nübel End-to-End Evaluation in VERBMOBIL I , 1997, MTSUMMIT.

[9]  Margaret King,et al.  Evaluating natural language processing systems , 1996, CACM.

[10]  Christopher Culy,et al.  The limits of n-gram translation evaluation metrics , 2003, MTSUMMIT.

[11]  Deborah A. Coughlin,et al.  Correlating automated and human assessments of machine translation quality , 2003, MTSUMMIT.

[12]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[13]  Niladri Chatterjee,et al.  Some Improvements over the BLEU Metric for Measuring Translation Quality for Hindi , 2007, 2007 International Conference on Computing: Theory and Applications (ICCTA'07).

[14]  Alon Lavie,et al.  End-to-End Evaluation in JANUS: A Speech-to-speech Translation System , 1996, ECAI Workshop on Dialogue Processing in Spoken Language Systems.

[15]  Andy Way,et al.  Dependency-Based Automatic Evaluation for Machine Translation , 2007, SSST@HLT-NAACL.

[16]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[17]  Alon Lavie,et al.  BLANC: Learning Evaluation Metrics for MT , 2005, HLT.

[18]  Brian A. Weiss,et al.  Performance Evaluation of Speech Translation Systems , 2008, LREC.

[19]  Alon Lavie,et al.  The significance of recall in automatic metrics for MT evaluation , 2004, AMTA.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .