论文信息 - AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model - 字舞流文

AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2 datasets and provides SOTA performance on multilingual tasks such as XNLI, XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case for seq2seq models as a powerful alternative to decoder-only models for Large-scale Language Model (LLM) training.

Jack G. M. FitzGerald | Stephen Rawls | Anna Rumshisky | G. Tur | W. Hamza | Apurv Verma | Charith S. Peris | Mukund Sridhar | Chandan Prakash | Saleh Soltan | Rahul Gupta | Andrew Rosenbaum | Premkumar Natarajan | Shankar Ananthakrishnan | Haidar Khan | Fabian Triefenbach | Wael Hamza

[1] Lisa Anne Hendricks,et al. Taxonomy of Risks posed by Language Models , 2022, FAccT.

[2] Dilek Z. Hakkani-Tür,et al. Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems , 2022, KDD.

[3] R. Zemel,et al. Differentially Private Decoding in Large Language Models , 2022, ArXiv.

[4] S. Gu,et al. Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[5] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[6] Hyung Won Chung,et al. What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? , 2022, ICML.

[7] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[8] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[9] Kai-Wei Chang,et al. Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal , 2022, Findings.

[10] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[11] Florian Tramèr,et al. Quantifying Memorization Across Neural Language Models , 2022, ICLR.

[12] Colin Raffel,et al. Deduplicating Training Data Mitigates Privacy Risks in Language Models , 2022, ICML.

[13] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[14] Xi Victoria Lin,et al. Few-shot Learning with Multilingual Generative Language Models , 2021, EMNLP.

[15] Quoc V. Le,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[16] Alexander M. Rush,et al. Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[17] Po-Sen Huang,et al. Challenges in Detoxifying Language Models , 2021, EMNLP.

[18] Alexander M. Rush,et al. Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.

[19] Nicholas Carlini,et al. Deduplicating Training Data Makes Language Models Better , 2021, ACL.

[20] Shannon L. Spruit,et al. Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling , 2021, ArXiv.

[21] Ruslan Salakhutdinov,et al. Towards Understanding and Mitigating Social Biases in Language Models , 2021, ICML.

[22] Max Ryabinin,et al. It’s All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning , 2021, FINDINGS.

[23] Marc'Aurelio Ranzato,et al. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , 2021, TACL.

[24] Kai-Wei Chang,et al. Societal Biases in Language Generation: Progress and Challenges , 2021, ACL.

[25] David R. So,et al. Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[26] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[27] Timo Schick,et al. Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP , 2021, Transactions of the Association for Computational Linguistics.

[28] Diyi Yang,et al. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[29] Kai-Wei Chang,et al. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation , 2021, FAccT.

[30] Milad Nasr,et al. Adversary Instantiation: Lower Bounds for Differentially Private Machine Learning , 2021, 2021 IEEE Symposium on Security and Privacy (SP).

[31] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[32] Holger Schwenk,et al. Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[33] Yejin Choi,et al. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[34] Olatunji Ruwase,et al. DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[35] Edouard Grave,et al. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[36] Jonathan Ullman,et al. Auditing Differentially Private Machine Learning: How Private is Private SGD? , 2020, NeurIPS.

[37] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[38] Solon Barocas,et al. Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[39] A. Korhonen,et al. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning , 2020, EMNLP.

[40] Sylvain Lamprier,et al. MLSUM: The Multilingual Summarization Corpus , 2020, EMNLP.

[41] Mary Williamson,et al. Recipes for Building an Open-Domain Chatbot , 2020, EACL.

[42] Jianfeng Gao,et al. UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[43] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[44] Tie-Yan Liu,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.

[45] Marjan Ghazvininejad,et al. Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[46] J. Weston,et al. Queens Are Powerful Too: Mitigating Gender Bias in Dialogue Generation , 2019, EMNLP.

[47] Lijun Wu,et al. Microsoft Research Asia’s Systems for WMT19 , 2019, WMT.

[48] Myle Ott,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[49] Verena Rieser,et al. Semantic Noise Matters for Neural Natural Language Generation , 2019, INLG.

[50] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[51] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[52] Jiliang Tang,et al. Does Gender Matter? Towards Fairness in Dialogue Systems , 2019, COLING.

[53] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[54] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[55] J. Yosinski,et al. Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.

[56] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[57] Nanyun Peng,et al. The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[58] J. M. Phillips,et al. On Measuring and Mitigating Biased Inferences of Word Embeddings , 2019, AAAI.

[59] Jason Weston,et al. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack , 2019, EMNLP.

[60] Jason Baldridge,et al. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.

[61] Alan W Black,et al. Quantifying Social Biases in Contextual Word Representations , 2019, ACL 2019.

[62] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[63] Xu Tan,et al. MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[64] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[65] David Evans,et al. Evaluating Differentially Private Machine Learning in Practice , 2019, USENIX Security Symposium.

[66] Guillaume Lample,et al. XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[67] Mirella Lapata,et al. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[68] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[69] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[70] Rachel Rudinger,et al. Gender Bias in Coreference Resolution , 2018, NAACL.

[71] Verena Rieser,et al. The E2E Dataset: New Challenges For End-to-End Generation , 2017, SIGDIAL Conference.

[72] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[73] Dan Roth,et al. Solving General Arithmetic Word Problems , 2016, EMNLP.

[74] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[75] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[76] C. Habel,et al. Language , 1931, NeuroImage.

[77] Vinh Q. Tran,et al. Unifying Language Learning Paradigms , 2022, ArXiv.

[78] Percy Liang,et al. Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[79] Michael White,et al. Structure-to-Text Generation with Self-Training, Acceptability Classifiers and Context-Conditioning for the GEM Shared Task , 2021, GEM.

[80] Thiago Castro Ferreira,et al. The 2020 Bilingual, Bi-Directional WebNLG+ Shared Task: Overview and Evaluation Results (WebNLG+ 2020) , 2020, WEBNLG.

[81] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[82] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[83] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .