SPEDE: Probabilistic Edit Distance Metrics for MT Evaluation

This paper describes Stanford University's submission to the Shared Evaluation Task of WMT 2012. Our proposed metric (SPEDE) computes probabilistic edit distance as predictions of translation quality. We learn weighted edit distance in a probabilistic finite state machine (pFSM) model, where state transitions correspond to edit operations. While standard edit distance models cannot capture long-distance word swapping or cross alignments, we rectify these shortcomings using a novel pushdown automaton extension of the pFSM model. Our models are trained in a regression framework, and can easily incorporate a rich set of linguistic features. Evaluated on two different prediction tasks across a diverse set of datasets, our methods achieve state-of-the-art correlation with human judgments.

[1]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[2]  Andy Way,et al.  Evaluating machine translation with LFG dependencies , 2007, Machine Translation.

[3]  Yifan He,et al.  The DCU Dependency-Based Metric in WMT-MetricsMATR 2010 , 2010, WMT@ACL.

[4]  Francisco Casacuberta,et al.  Probabilistic finite-state machines - part I , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Yin Chen,et al.  A Re-examination on Features in Regression Based Approach to Automatic MT Evaluation , 2008, ACL.

[6]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[7]  Philipp Koehn,et al.  Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.

[8]  Alon Lavie,et al.  Extending the METEOR Machine Translation Evaluation Metric to the Phrase Level , 2010, NAACL.

[9]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[10]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[11]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[12]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[13]  Jason Eisner,et al.  Parameter Estimation for Probabilistic Finite-State Transducers , 2002, ACL.

[14]  Ely Porat,et al.  Approximate Swap and Mismatch Edit Distance , 2007, SPIRE.

[15]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[16]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[17]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[18]  Liang Zhou,et al.  Re-evaluating Machine Translation Results with Paraphrase Support , 2006, EMNLP.

[19]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[20]  Yaser Al-Onaizan,et al.  Translation with Finite-State Devices , 1998, AMTA.

[21]  Rebecca Hwa,et al.  Regression for Sentence-Level MT Evaluation with Pseudo References , 2007, ACL.

[22]  Daniel Jurafsky,et al.  Robust Machine Translation Evaluation with Entailment Features , 2009, ACL.

[23]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[24]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[25]  Rebecca Hwa,et al.  A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation , 2007, ACL.

[26]  Shankar Kumar,et al.  A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation , 2003, NAACL.

[27]  Daniel Marcu,et al.  How To Select An Answer String , 2008 .

[28]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[29]  Alexandra Birch,et al.  A Quantitative Analysis of Reordering Phenomena , 2009, WMT@EACL.

[30]  Javier Esparza,et al.  Quantitative analysis of probabilistic pushdown automata: expectations and variances , 2005, 20th Annual IEEE Symposium on Logic in Computer Science (LICS' 05).

[31]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[32]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[33]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[34]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.