Living on the edge: productivity gain thresholds in machine translation evaluation metrics

This paper studies the minimum score at which machine translation (MT) evaluation metrics report productivity gains in a machine translation post-editing (MTPE) task. We ran an experiment involving 10 professional in-house translators from our company in which they were asked to carry out a real translation task involving MTPE, translation from scratch and fuzzymatch editing. We then analyzed the results and evaluated the MT output with traditional MT evaluation metrics such as BLEU and TER, as well as the standard used in the translation industry to analyze text similarity in translation memory (TM) matches: the fuzzy score. We report where the threshold between productivity gain and productivity loss lies and contrast it with past experiences in our company. We also compare the productivity of similar segments from MTPE and TM match editing samples in order to gain further insights on their cognitive effort and pricing schemes.

[1]  Philipp Koehn,et al.  The Impact of Machine Translation Quality on Human Post-Editing , 2014, HaCaT@EACL.

[2]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[3]  Lluís Màrquez i Villodre,et al.  Asiya: An Open Toolkit for Automatic Machine Translation (Meta-)Evaluation , 2010, Prague Bull. Math. Linguistics.

[4]  François Masselot,et al.  A Productivity Test of Statistical Machine Translation Post-Editing in a Typical Localisation Context , 2010, Prague Bull. Math. Linguistics.

[5]  Josef van Genabith,et al.  CATaLog: New Approaches to TM and Post Editing Interfaces , 2015 .

[6]  Alon Lavie,et al.  Evaluating the Output of Machine Translation Systems , 2010, AMTA.

[7]  Ana Guerberof Arenas Productivity and Quality in the Post-editing of Outputs from Translation Memories and Machine Translation , 2008 .

[8]  Marcello Federico Measuring User Productivity in Machine Translation Enhanced Computer Assisted Translation , 2012, AMTA.

[9]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[10]  Lucia Specia,et al.  Post-editing time as a measure of cognitive effort , 2012, AMTA.

[11]  Carla Parra Escartín,et al.  A fuzzier approach to machine translation evaluation: A pilot study on post-editing productivity and automated metrics in commercial settings , 2015, HyTra@ACL.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[14]  Carla Parra Escartín,et al.  Machine translation evaluation made fuzzier: a study on post-editing productivity and evaluation metrics in commercial settings , 2015, MTSUMMIT.