论文信息 - On the Evaluation of Machine Translation n-best Lists

On the Evaluation of Machine Translation n-best Lists

The standard machine translation evaluation framework measures the single-best output of machine translation systems. There are, however, many situations where n-best lists are needed, yet there is no established way of evaluating them. This paper establishes a framework for addressing n-best evaluation by outlining three different questions one could consider when determining how one would define a ‘good’ n-best list and proposing evaluation measures for each question. The first and principal contribution is an evaluation measure that characterizes the translation quality of an entire n-best list by asking whether many of the valid translations are placed near the top of the list. The second is a measure that uses gold translations with preference annotations to ask to what degree systems can produce ranked lists in preference order. The third is a measure that rewards partial matches, evaluating the closeness of the many items in an n-best list to a set of many valid references. These three perspectives make clear that having access to many references can be useful when n-best evaluation is the goal.

Huda Khayrallah | Matt Post | Jacob Bremerman | Douglas Oard

[1] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[2] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3] Ahmed Abdelali,et al. The AMARA Corpus: Building Parallel Language Resources for the Educational Domain , 2014, LREC.

[4] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5] Klinton Bicknell,et al. Simultaneous Translation and Paraphrase for Language Education , 2020, NGT.

[6] Justin Zobel,et al. How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[7] Huda Khayrallah,et al. The JHU Submission to the 2020 Duolingo Shared Task on Simultaneous Translation and Paraphrase for Language Education , 2020, NGT@ACL.

[8] Jaana Kekäläinen,et al. Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[9] Philipp Koehn,et al. Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[10] Douglas W. Oard,et al. Pearson Rank: A Head-Weighted Gap-Sensitive Score-Based Correlation Coefficient , 2016, SIGIR.

[11] Emine Yilmaz,et al. Inferring document relevance from incomplete information , 2007, CIKM '07.

[12] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[13] Zeljko Agic,et al. JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[14] Holger Schwenk,et al. WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[15] Daniel Marcu,et al. HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[16] Jörg Tiedemann,et al. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles , 2016, LREC.