论文信息 - Few-shot Learning with Multilingual Language Models

Few-shot Learning with Multilingual Language Models

Large-scale autoregressive language models such as GPT-3 are few-shot learners that can perform a wide range of language tasks without fine-tuning. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages, and study their fewand zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4-shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 on 171 out of 182 translation directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning on some tasks, while there is still room for improvement on surface form robustness and adaptation to tasks that do not have a natural cloze form. Finally, we evaluate our models in social value tasks such as hate speech detection in 5 languages and find it has limitations similar to comparably sized GPT-3 models.

[1] Yejin Choi,et al. PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[2] Dan Klein,et al. Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[3] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[4] Jason Baldridge,et al. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification , 2019, EMNLP.

[5] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[6] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7] Jinlan Fu,et al. XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation , 2021, EMNLP.

[8] Marjan Ghazvininejad,et al. Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[9] Graham Neubig,et al. X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models , 2020, EMNLP.

[10] Peter Clark,et al. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[11] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[12] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[13] Nathanael Chambers,et al. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.

[14] Siva Reddy,et al. StereoSet: Measuring stereotypical bias in pretrained language models , 2020, ACL.

[15] Huajun Chen,et al. Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners , 2021, ArXiv.

[16] Guanghui Qin,et al. Learning How to Ask: Querying LMs with Mixtures of Soft Prompts , 2021, NAACL.

[17] Eneko Agirre,et al. Translation Artifacts in Cross-lingual Transfer Learning , 2020, EMNLP.

[18] Christophe Gravier,et al. T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[19] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[20] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[21] Arthur C. Graesser,et al. Coh-Metrix: Analysis of text on cohesion and language , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[22] Eva Schlinger,et al. How Multilingual is Multilingual BERT? , 2019, ACL.

[23] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[24] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[25] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[26] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[27] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[28] Hinrich Schutze,et al. Discrete and Soft Prompting for Multilingual Models , 2021, EMNLP.

[29] Qianchu Liu,et al. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning , 2020, EMNLP.

[30] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[31] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[32] Veselin Stoyanov,et al. Emerging Cross-lingual Structure in Pretrained Language Models , 2020, ACL.

[33] Alexander M. Rush,et al. How many data points is a prompt worth? , 2021, NAACL.