Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist

Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. However, their ability to translate paragraphs and documents remains unexplored because evaluation in these settings is costly and difficult. We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragraph (e.g., from a novel) at once results in higher-quality translations than standard sentence-by-sentence translation across 18 linguistically-diverse language pairs (e.g., translating into and out of Japanese, Polish, and English). Our evaluation, which took approximately 350 hours of effort for annotation and analysis, is conducted by hiring translators fluent in both the source and target language and asking them to provide both span-level error annotations as well as preference judgments of which system's translations are better. We observe that discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. With that said, critical errors still abound, including occasional content omissions, and a human translator's intervention remains necessary to ensure that the author's voice remains intact. We publicly release our dataset and error annotations to spur future research on evaluation of document-level literary translation.

[1]  Alexandra Birch,et al.  Hallucinations in Large Multilingual Translation Models , 2023, Transactions of the Association for Computational Linguistics.

[2]  Haewoon Kwak,et al.  Can we trust the evaluation on ChatGPT? , 2023, TRUSTNLP.

[3]  Daniel Rock,et al.  GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models , 2023, ArXiv.

[4]  Hany Hassan Awadalla,et al.  How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation , 2023, ArXiv.

[5]  Luke Zettlemoyer,et al.  Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation , 2023, ArXiv.

[6]  Zhaopeng Tu,et al.  Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine , 2023, 2301.08745.

[7]  Alexandra Birch,et al.  Prompting Large Language Model for Machine Translation: A Case Study , 2023, ICML.

[8]  Ehud Reiter,et al.  Evaluating factual accuracy in complex data-to-text , 2023, Comput. Speech Lang..

[9]  George F. Foster,et al.  Prompting PaLM for Translation: Assessing Strategies and Performance , 2022, ArXiv.

[10]  Mohit Iyyer,et al.  Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature , 2022, EMNLP.

[11]  Mohit Iyyer,et al.  DEMETR: Diagnosing Evaluation Metrics for Translation , 2022, EMNLP.

[12]  Eric Michael Smith,et al.  Toxicity in Multilingual Machine Translation at Scale , 2022, ArXiv.

[13]  Marcello Federico,et al.  Embarrassingly Easy Document-Level MT Metrics: How to Convert Any Pretrained Metric into a Document-Level Metric , 2022, WMT.

[14]  Philipp Koehn,et al.  Learn To Remember: Transformer with Recurrent Memory for Document-Level Machine Translation , 2022, NAACL-HLT.

[15]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[16]  Sebastian Gehrmann,et al.  Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text , 2022, J. Artif. Intell. Res..

[17]  Orhan Firat,et al.  Multilingual Document-Level Translation Enables Zero-Shot Transfer From Sentences to Documents , 2021, ACL.

[18]  Rico Sennrich,et al.  BlonDe: An Automatic Evaluation Metric for Document-level Machine Translation , 2021, NAACL.

[19]  A. Lavie,et al.  COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task , 2022, WMT.

[20]  Marzena Karpinska,et al.  The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation , 2021, EMNLP.

[21]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[22]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[23]  Lei Yu,et al.  Capturing document context inside sentence-level neural machine translation models with self-training , 2020, CODI.

[24]  Alon Lavie,et al.  Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task , 2021, WMT.

[25]  Sheila Castilho Towards Document-Level Human MT Evaluation: On the Issues of Annotator Agreement, Effort and Misevaluation , 2021, HUMEVAL.

[26]  Rachel Bawden,et al.  Document-level Neural MT: A Systematic Comparison , 2020, EAMT.

[27]  Yang Zhao,et al.  Dynamic Context Selection for Document-level Neural Machine Translation via Reinforcement Learning , 2020, EMNLP.

[28]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[29]  Chao Han,et al.  Translation quality assessment: a critical methodological review , 2020, The Translator.

[30]  Jianwei Cui,et al.  Modeling Discourse Structure for Document-level Neural Machine Translation , 2020, AUTOSIMTRANS.

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[33]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[34]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[35]  Alexandra Birch,et al.  Towards Making the Most of Context in Neural Machine Translation , 2020, International Joint Conference on Artificial Intelligence.

[36]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[37]  Guodong Zhou,et al.  Hierarchical Modeling of Global Context for Document-Level Neural Machine Translation , 2019, EMNLP.

[38]  Kristiina Taivalkoski-Shilov Free indirect discourse: an insurmountable challenge for literary MT systems? , 2019 .

[39]  Marcin Junczys-Dowmunt,et al.  Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation , 2019, WMT.

[40]  Rico Sennrich,et al.  When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion , 2019, ACL.

[41]  T. Avgustinova,et al.  Intelligibility of Highly Predictable Polish Target Words in Sentences Presented to Czech Readers , 2019, CICLing.

[42]  Huanbo Luan,et al.  Improving the Transformer Translation Model with Document-Level Context , 2018, EMNLP.

[43]  Kristiina Taivalkoski-Shilov,et al.  Ethical issues regarding machine(-assisted) translation of literary texts , 2018, Perspectives.

[44]  James Henderson,et al.  Document-Level Neural Machine Translation with Hierarchical Attention Networks , 2018, EMNLP.

[45]  Matteo Negri,et al.  Contextual Handling in Neural Machine Translation: Look behind, ahead and on both sides , 2018, EAMT.

[46]  Per B. Brockhoff,et al.  lmerTest Package: Tests in Linear Mixed Effects Models , 2017 .

[47]  Jörg Tiedemann,et al.  Neural Machine Translation with Extended Context , 2017, DiscoMT@EMNLP.

[48]  Orhan Firat,et al.  Does Neural Machine Translation Benefit from Larger Context? , 2017, ArXiv.

[49]  W. Ning,et al.  Comparative literature and translation: A cross-cultural and interdisciplinary perspective , 2016 .

[50]  Andy Way,et al.  Machine-assisted translation of literary text , 2015 .

[51]  A. Burchardt,et al.  Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics , 2014 .

[52]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[53]  Aljoscha Burchardt,et al.  Assessing Inter-Annotator Agreement for Translation Error Annotation , 2014 .

[54]  Eiichiro Sumita,et al.  Document-level re-ranking with soft lexical and semantic features for statistical machine translation , 2014, AMTA.

[55]  Christian Hardmeier,et al.  Discourse in Statistical Machine Translation , 2014 .

[56]  Laurent Besacier Machine translation for litterature: a pilot study (Traduction automatisée d’une oeuvre littéraire: une étude pilote) [in French] , 2014, JEP/TALN/RECITAL.

[57]  Jörg Tiedemann,et al.  Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation , 2013, ACL.

[58]  Marine Carpuat,et al.  The Trouble with SMT Consistency , 2012, WMT@NAACL-HLT.

[59]  Sharon O'Brien,et al.  Translation as human–computer interaction , 2012 .

[60]  R. Baayen,et al.  Mixed-effects modeling with crossed random effects for subjects and items , 2008 .

[61]  Mirella Lapata,et al.  Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora , 2007, ACL.

[62]  Hitoshi Isahara,et al.  A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation , 2007, NAACL.

[63]  J. Kaplansky Outside The Stranger? English Retranslations of Camus’ L’Étranger , 2004 .

[64]  lucía molina,et al.  Translation techniques revisited. A dynamic and functionalist approach , 2004 .

[65]  J. Sager What Distinguishes Major Types of Translation , 1998 .