论文信息 - Is my Judge a good One?

Is my Judge a good One?

This paper aims at measuring the reliability of judges in MT evaluation. The scope is two evaluation campaigns from the CESTA project, during which human evaluations were carried out on fluency and adequacy criteria for English-to-French documents. Our objectives were threefold: observe both inter- and intra-judge agreements, and then study the influence of the evaluation design especially implemented for the need of the campaigns. Indeed, a web interface was especially developed to help with the human judgments and store the results, but some design changes were made between the first and the second campaign. Considering the low agreements observed, the judges' behaviour has been analysed in that specific context. We also asked several judges to repeat their own evaluations a few times after the first judgments done during the official evaluation campaigns. Even if judges did not seem to agree fully at first sight, a less strict comparison led to a strong agreement. Furthermore, the evolution of the design during the project seemed to have been a source for the difficulties that judges encountered to keep the same interpretation of quality.

Olivier Hamon | O. Hamon

[1] Michelle Vanni,et al. Inter-Rater Agreement Measures and the Refinement of Metrics in the PLATO MT Evaluation Paradigm , 2005 .

[2] Philipp Koehn,et al. (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[3] Steven Abney,et al. How and Where do People Fail with Time: Temporal Reference Mapping Annotation by Chinese and English Bilinguals , 2006 .

[4] John B. Carroll. An experiment in evaluating the quality of translations , 1966, Mech. Transl. Comput. Linguistics.

[5] Chiori Hori,et al. Overview of the IWSLT 2005 Evaluation Campaign , 2005, IWSLT.

[6] J. R. Landis,et al. The measurement of observer agreement for categorical data. , 1977, Biometrics.

[7] John S. White,et al. The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[8] France,et al. Diagnosing Human Judgments in MT Evaluation : an Example based on the Spanish Language , 2008 .

[9] Khalid Choukri,et al. Assessing Human and Automated Quality Judgments in the French MT Evaluation Campaign CESTA , 2007 .

[10] Christian Boitet,et al. Towards fairer evaluations of commercial MT systems on basic travel expressions corpora , 2004, IWSLT.

[11] A. Feinstein,et al. High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.