Improving language models by retrieving from trillions of tokens

Language modelling (LM) is an unsupervised task that consists of modelling the probability of text, usually by factorising it into conditional next-token predictions p(x1, . . . , xn) = ∏ i p(xi |x<i). Neural networks have proven to be powerful language models, first in the form of recurrent architectures (Graves, 2013; Jozefowicz et al., 2016; Mikolov et al., 2010) and more recently in the form of Transformers (Vaswani et al., 2017), that use attention to contextualise the past. Large performance improvements have come from increasing the amount of data, training compute, or model parameters. Transformers have been scaled from 100 million parameter models in seminal work to over hundred billion parameters (Brown et al., 2020; Radford et al., 2019) in the last two years which has led to models that do very well on a wide array of tasks in a zero or few-shot formulation. Increasing model size predictably improves performance on a wide range of downstream tasks (Kaplan et al., 2020). The benefits of increasing the number of parameters come from two factors: additional computations at training and inference time, and increased memorization of the training data.

[1]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[2]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[3]  Sebastian Riedel,et al.  Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets , 2020, EACL.

[4]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[5]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[6]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[7]  Jason Weston,et al.  Retrieval Augmentation Reduces Hallucination in Conversation , 2021, EMNLP.

[8]  Yoshua Bengio,et al.  A Neural Knowledge Language Model , 2016, ArXiv.

[9]  Emily M. Bender,et al.  On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[10]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[11]  Satoshi Nakamura,et al.  Guiding Neural Machine Translation with Retrieved Translation Pieces , 2018, NAACL.

[12]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[13]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2020, ICLR.

[14]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[15]  Sanjiv Kumar,et al.  Accelerating Large-Scale Inference with Anisotropic Vector Quantization , 2019, ICML.

[16]  Timnit Gebru,et al.  Lessons from archives: strategies for collecting sociocultural data in machine learning , 2019, FAT*.

[17]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[18]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[19]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[20]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[21]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[22]  Yejin Choi,et al.  RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , 2020, FINDINGS.

[23]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[24]  Douglas Eck,et al.  Deduplicating Training Data Makes Language Models Better , 2021, ArXiv.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[27]  Dani Yogatama,et al.  End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering , 2021, NeurIPS.

[28]  Sandro Pezzelle,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[29]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[30]  Jason Weston,et al.  Internet-Augmented Dialogue Generation , 2021, ACL.

[31]  Olatunji Ruwase,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Yonatan Belinkov,et al.  Interpretability and Analysis in Neural NLP , 2020, ACL.

[33]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[34]  Rico Sennrich,et al.  Root Mean Square Layer Normalization , 2019, NeurIPS.

[35]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[36]  Ming-Wei Chang,et al.  Retrieval Augmented Language Model Pre-Training , 2020, ICML.

[37]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[38]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[39]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[40]  W. Bruce Croft,et al.  Guided Transformer: Leveraging Multiple External Sources for Representation Learning in Conversational Search , 2020, SIGIR.

[41]  Phil Blunsom,et al.  Pitfalls of Static Language Modelling , 2021, ArXiv.

[42]  Thorsten Brants,et al.  Large Language Models in Machine Translation , 2007, EMNLP.

[43]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[44]  Po-Sen Huang,et al.  Ethical and social risks of harm from Language Models , 2021, ArXiv.

[45]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[46]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[47]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[48]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[49]  Alberto Montresor,et al.  WikiLinkGraphs: A complete, longitudinal and multi-language dataset of the Wikipedia link networks , 2019, ICWSM.

[50]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[51]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[52]  Yong Wang,et al.  Search Engine Guided Neural Machine Translation , 2018, AAAI.

[53]  Nicola De Cao,et al.  A Memory Efficient Baseline for Open Domain Question Answering , 2020, ArXiv.

[54]  Dani Yogatama,et al.  Adaptive Semiparametric Language Models , 2021, Transactions of the Association for Computational Linguistics.