Diagnostic Evaluation of Machine Translation Systems Using Automatically Constructed Linguistic Check-Points

We present a diagnostic evaluation platform which provides multi-factored evaluation based on automatically constructed check-points. A check-point is a linguistically motivated unit (e.g. an ambiguous word, a noun phrase, a verb~obj collocation, a prepositional phrase etc.), which are pre-defined in a linguistic taxonomy. We present a method that automatically extracts check-points from parallel sentences. By means of checkpoints, our method can monitor a MT system in translating important linguistic phenomena to provide diagnostic evaluation. The effectiveness of our approach for diagnostic evaluation is verified through experiments on various types of MT systems.

[1]  Martin Chodorow,et al.  An Unsupervised Method for Detecting Grammatical Errors , 2000, ANLP.

[2]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[3]  Lluís Màrquez i Villodre,et al.  Linguistic Features for Automatic Evaluation of Heterogenous MT Systems , 2007, WMT@ACL.

[4]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[5]  Ming Zhou,et al.  Sentence Level Machine Translation Evaluation as a Ranking , 2007, WMT@ACL.

[6]  NeyHermann,et al.  A systematic comparison of various statistical alignment models , 2003 .

[7]  Ming Zhou A Block-Based Robust Dependency Parser for Unrestricted Chinese Text , 1999, ACL 2000.

[8]  Ding Liu,et al.  Syntactic Features for Evaluation of Machine Translation , 2005, IEEvaluation@ACL.

[9]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[10]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[11]  Shiwen Yu Automatic evaluation of output quality for Machine Translation systems , 1993, Machine Translation.

[12]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[13]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[14]  Yu Shiwen Automatic evaluation of output quality for Machine Translation systems , 1993 .

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[17]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.