Prompting PaLM for Translation: Assessing Strategies and Performance

Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM’s MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM’s MT output which reveals some interesting properties and prospects for future work.

[1]  Orhan Firat,et al.  Cross-Lingual Supervision improves Large Language Models Pre-training , 2023, ArXiv.

[2]  George F. Foster,et al.  Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability , 2023, ACL.

[3]  Lei Li,et al.  Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis , 2023, ArXiv.

[4]  Mohit Iyyer,et al.  Large Language Models Effectively Leverage Document-level Context for Literary Translation, but Critical Errors Persist , 2023, WMT.

[5]  Zhaopeng Tu,et al.  Document-Level Machine Translation with Large Language Models , 2023, ArXiv.

[6]  Alexandra Birch,et al.  Hallucinations in Large Multilingual Translation Models , 2023, Transactions of the Association for Computational Linguistics.

[7]  Orhan Firat,et al.  Bilex Rx: Lexical Data Augmentation for Massively Multilingual Machine Translation , 2023, ArXiv.

[8]  Franccois Yvon,et al.  Investigating the Translation Performance of a Large Multilingual Language Model: the Case of BLOOM , 2023, EAMT.

[9]  Hany Hassan Awadalla,et al.  How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation , 2023, ArXiv.

[10]  Luke Zettlemoyer,et al.  Dictionary-based Phrase-level Prompting of Large Language Models for Machine Translation , 2023, ArXiv.

[11]  George F. Foster,et al.  The unreasonable effectiveness of few-shot learning for machine translation , 2023, ICML.

[12]  Andy Way,et al.  Adaptive Machine Translation with Large Language Models , 2023, EAMT.

[13]  Orhan Firat,et al.  Interactive-Chain-Prompting: Ambiguity Resolution for Crosslingual Conditional Generation with Interaction , 2023, ArXiv.

[14]  Alexandra Birch,et al.  Prompting Large Language Model for Machine Translation: A Case Study , 2023, ICML.

[15]  M. Lewis,et al.  In-context Examples Selection for Machine Translation , 2022, ACL.

[16]  Eric Michael Smith,et al.  Toxicity in Multilingual Machine Translation at Scale , 2022, ArXiv.

[17]  Alexander M. Rush,et al.  Explaining Patterns in Data with Language Models via Interpretable Autoprompting , 2022, ArXiv.

[18]  Mohammad Sadegh Rasooli,et al.  Bidirectional Language Models Are Also Few-shot Learners , 2022, ICLR.

[19]  Wanxiang Che,et al.  MetaPrompting: Learning to Learn Better Prompts , 2022, COLING.

[20]  David Vandyke,et al.  Prompting for a conversation: How to control a dialog model? , 2022, CAI.

[21]  Alexander M. Rush,et al.  Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models , 2022, IEEE Transactions on Visualization and Computer Graphics.

[22]  Percy Liang,et al.  What Can Transformers Learn In-Context? A Case Study of Simple Function Classes , 2022, NeurIPS.

[23]  Mrinmaya Sachan,et al.  Probing via Prompting , 2022, NAACL.

[24]  Colin Raffel,et al.  Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , 2022, NeurIPS.

[25]  Wayne Xin Zhao,et al.  Learning to Transfer Prompts for Text Generation , 2022, NAACL.

[26]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[27]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[28]  Orhan Firat,et al.  Using natural language prompts for machine translation , 2022, ArXiv.

[29]  Alexander M. Rush,et al.  PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts , 2022, ACL.

[30]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[31]  Noah A. Smith,et al.  Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection , 2022, EMNLP.

[32]  Marcin Junczys-Dowmunt,et al.  To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , 2021, WMT.

[33]  Partha P. Talukdar,et al.  Reordering Examples Helps during Priming-based Few-Shot Learning , 2021, FINDINGS.

[34]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[35]  S. Riedel,et al.  Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity , 2021, ACL.

[36]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[37]  Dan Roth,et al.  A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods , 2021, Transactions of the Association for Computational Linguistics.

[38]  Laria Reynolds,et al.  Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[39]  Weizhu Chen,et al.  What Makes Good In-Context Examples for GPT-3? , 2021, DEELIO.

[40]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[41]  Karen Hambardzumyan,et al.  WARP: Word-level Adversarial ReProgramming , 2021, ACL.

[42]  Sameer Singh,et al.  Eliciting Knowledge from Language Models Using Automatically Generated Prompts , 2020, EMNLP.

[43]  Sebastian Ruder,et al.  Rethinking embedding coupling in pre-trained language models , 2020, ICLR.

[44]  Colin Raffel,et al.  mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[45]  Mike Lewis,et al.  Nearest Neighbor Machine Translation , 2020, ICLR.

[46]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[47]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[48]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[49]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[50]  Sanjiv Kumar,et al.  Accelerating Large-Scale Inference with Anisotropic Vector Quantization , 2019, ICML.

[51]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[52]  Markus Freitag,et al.  APE at Scale and Its Implications on MT Evaluation Biases , 2019, WMT.

[53]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[54]  A. Burchardt,et al.  Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics , 2014 .

[55]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[56]  Zhaopeng Tu,et al.  Is ChatGPT A Good Translator? A Preliminary Study , 2023, ArXiv.

[57]  A. Lavie,et al.  Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain , 2021, WMT.

[58]  Zhaopeng Tu,et al.  Tencent Translation System for the WMT21 News Translation Task , 2021, WMT.

[59]  Yanjun Liu,et al.  WeChat Neural Machine Translation Systems for WMT21 , 2021, WMT.

[60]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[61]  Philipp Koehn,et al.  Facebook AI’s WMT21 News Translation Task Submission , 2021, WMT.

[62]  Markus Freitag,et al.  Findings of the 2021 Conference on Machine Translation (WMT21) , 2021, WMT.

[63]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[64]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .