A Comparative Evaluation of Research vs. Online MT Systems

This paper reports MT evaluation experiments that were conducted at the end of year 1 of the EU-funded CoSyne 1 project for three language combinations, considering translations from German, Italian and Dutch into English. We present a comparative evaluation of the MT software developed within the project against four of the leading free webbased MT systems across a range of state-of-the-art automatic evaluation metrics. The data sets from the news domain that were created and used for training purposes and also for this evaluation exercise, which are available to the research community, are also described. The evaluation results for the news domain are very encouraging: the CoSyne MT software consistently beats the rule-based MT systems, and for translations from Italian and Dutch into English in particular the scores given by some of the standard automatic evaluation metrics are not too distant from those obtained by wellestablished statistical online MT systems.

[1]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[2]  Andy Way,et al.  Dependency-Based Automatic Evaluation for Machine Translation , 2007, SSST@HLT-NAACL.

[3]  Federico Gaspari The Added Value of Free Online MT Services: Confidence Boosters for Linguistically-challenged Internet Users, a Case Study for the Language Pair Italian-English , 2006 .

[4]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[5]  Chris Callison-Burch,et al.  Proceedings of the Second Workshop on Statistical Machine Translation , 2007 .

[6]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[7]  Manisha Sharma,et al.  Evaluation of machine translation , 2011, ICWET.

[8]  Tiejun Zhao,et al.  Diagnostic Evaluation of Machine Translation Systems Using Automatically Constructed Linguistic Check-Points , 2008, COLING.

[9]  Yifan He,et al.  The DCU Dependency-Based Metric in WMT-MetricsMATR 2010 , 2010, WMT@ACL.

[10]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.

[11]  Alon Lavie,et al.  METEOR-NEXT and the METEOR Paraphrase Tables: Improved Evaluation Support for Five Target Languages , 2010, WMT@ACL.

[12]  Philipp Koehn,et al.  Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.

[13]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[14]  Joseph P. Turian,et al.  Evaluation of machine translation and its evaluation , 2003, MTSUMMIT.

[15]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[16]  Christof Monz,et al.  The UvA system description for IWSLT 2010 , 2010, IWSLT.

[17]  Jonathan Ginzburg,et al.  Proceedings of COLING 2004 , 2004 .

[18]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[19]  Federico Gaspari,et al.  Online and free! Ten years of online machine translation: origins, developments, current use and future prospects , 2007, MTSUMMIT.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[22]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.