Pour l’évaluation externe des systèmes de TA par des méthodes fondées sur la tâche [For an external evaluation of MT systems by task-based methods]

External methods for evaluating MT systems define various measures based on MT results and their usage. While operational systems are mostly evaluated since long by task- based methods, evaluation campaigns of the last years use (parsimoniously) quite expensive subjective methods based on unreliable human judgments, and (for the most part) methods based on reference translations, that are impossible to use during the real usage of a system, less correlated with human judgments when quality increases, and totally unrealistic in that they force to measure progress on fixed corpora, endlessly retranslated, and not on new texts to be translated for real needs. There are also numerous biases introduced by the desire to diminish costs, in particular the usage of parallel corpora in the direction opposed to that of their production, and of monolingual rather than bilingual judges. We prove the above by an analysis of the history of MT evaluation, of the « mainstream » evaluation methods, and of certain recent evaluation campaigns. We propose to abandon the reference-based methods in external evaluations, and to replace them by strictly task-based methods, while reserving them for internal evaluations.

[1]  Christian Boitet,et al.  Spoken dialogue translation systems evaluation: results, new trends, problems and proposals , 2004, IWSLT.

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[4]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[5]  Martin Rajman,et al.  Automatically predicting MT systems rankings compatible with fluency, adequacy and informativeness scores , 2001, MTSUMMIT.

[6]  Andrei Popescu-Belis,et al.  Principles of Context-Based Machine Translation Evaluation , 2002, Machine Translation.

[7]  John White,et al.  Determining the Tolerance of Text-handling Tasks for MT Output , 2000, LREC.

[8]  Solange Rossato,et al.  Speech-to-speech translation system evaluation: results for French for the NESPOLE! project first showcase , 2002, INTERSPEECH.

[9]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[10]  Eric Brill,et al.  A Unified Framework For Automatic Evaluation Using 4-Gram Co-occurrence Statistics , 2004, ACL.

[11]  Francisco Casacuberta,et al.  A Quantitative Method for Machine Translation Evaluation , 2003 .

[12]  Denyse Baillargeon,et al.  Bibliographie , 1929 .

[13]  Andrei Popescu-Belis,et al.  CESTA: First Conclusions of the Technolangue MT Evaluation Campaign , 2006, LREC.

[14]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[15]  Noriko Kando,et al.  Overview of the IWSLT04 evaluation campaign , 2004, IWSLT.

[16]  Olivier Hamon,et al.  X-Score: Automatic Evaluation of Machine Translation Grammaticality , 2006, LREC.

[17]  Sergei Nirenburg,et al.  Readings in Machine Translation , 2003 .

[18]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[19]  Hervé Blanchon HLT modules scalability within the NESPOLE! project , 2004, INTERSPEECH.

[20]  Margaret King,et al.  Evaluating natural language processing systems , 1996, CACM.

[21]  C. Boitet,et al.  The TRANSBey Prototype : An Online Collaborative Wiki-Based CAT Environment for Volunteer Translators , 2006 .

[22]  T. Takezawa,et al.  Precise measurement method of a speech translation system’s capability with a paired comparison method between the system and humans , 2001, MTSUMMIT.

[23]  Gregory A. Sanders,et al.  Edit Distance: A Metric for Machine Translation Evaluation , 2006, LREC.

[24]  Kenneth Ward Church,et al.  Good applications for crummy machine translation , 1993, Machine Translation.

[25]  Christopher Culy,et al.  The limits of n-gram translation evaluation metrics , 2003, MTSUMMIT.

[26]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[27]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[28]  John White,et al.  Predicting what MT is good for: user judgments and task performance , 1998, AMTA.

[29]  Bogdan Babych,et al.  Extending the BLEU MT Evaluation Method with Frequency Weightings , 2004, ACL.

[30]  Margaret King,et al.  Evaluation of natural language processing systems , 1991 .

[31]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[32]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[33]  Maghi King Living up to standards , 2003 .

[34]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[35]  Sergei Nirenburg,et al.  ALPAC: The (In)Famous Report , 2003 .

[36]  Christian Boitet,et al.  Towards fairer evaluations of commercial MT systems on basic travel expressions corpora , 2004, IWSLT.

[37]  Khalid Choukri,et al.  Evaluation of Automatic Speech Recognition and Speech Language Translation within TC-STAR: Results from the first evaluation campaign , 2006, LREC.

[38]  Le An Ha,et al.  A computer-aided environment for generating multiple-choice test items , 2006, Natural Language Engineering.

[39]  Fabio Pianesi,et al.  The NESPOLE! System for multilingual speech communication over the Internet , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Bogdan Babych,et al.  Modelling Legitimate Translation Variation for Automatic Evaluation of MT Quality , 2004, LREC.

[41]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[42]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[43]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[44]  Hiroshi Maruyama An Interactive Japanese Parser for Machine Translation , 1990, COLING.