Training Language Models with Memory Augmentation

Recent work has improved language models (LMs) remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce mem-ories at testing time or represent them using a separately trained encoder, resulting in suboptimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training LMs with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories—local, long-term, and external memory—at testing time. We evaluate TRIME on multiple language modeling and machine translation benchmarks and show that it is able to achieve significant improvements across all the settings. Concretely, TRIME reduces the perplexity from 18.70 to 15.37 on WIKITEXT-103, by effectively leveraging a large memory set from the training corpus. Compared to standard LM training, TRIME adds negligible computational overhead and is compatible with different neural architectures, making it a versatile solution for training memory-augmented LMs.

[1]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[2]  Minlie Huang,et al.  LaMemo: Language Modeling with Look-Ahead Memory , 2022, NAACL.

[3]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[4]  Markus N. Rabe,et al.  Memorizing Transformers , 2022, ICLR.

[5]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[6]  Tianwei Zhang,et al.  GNN-LM: Language Modeling based on Global Contexts via GNN , 2021, ICLR.

[7]  Michiel de Jong,et al.  MENTION MEMORY : INCORPORATING TEXTUAL KNOWLEDGE INTO TRANSFORMERS THROUGH ENTITY MENTION ATTENTION , 2022, ICLR.

[8]  Taylor Berg-Kirkpatrick,et al.  Efficient Nearest Neighbor Language Models , 2021, EMNLP.

[9]  Dani Yogatama,et al.  End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering , 2021, NeurIPS.

[10]  Jiajun Chen,et al.  Adaptive Nearest Neighbor Machine Translation , 2021, ACL.

[11]  Jason Weston,et al.  Not All Memories are Created Equal: Learning to Forget by Expiring , 2021, ICML.

[12]  Fei Sha,et al.  ReadTwice: Reading Very Large Documents with Memories , 2021, NAACL.

[13]  Roy Schwartz,et al.  Random Feature Attention , 2021, ICLR.

[14]  Tao Lei,et al.  When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute , 2021, EMNLP.

[15]  Dani Yogatama,et al.  Adaptive Semiparametric Language Models , 2021, Transactions of the Association for Computational Linguistics.

[16]  Mike Lewis,et al.  Nearest Neighbor Machine Translation , 2020, ICLR.

[17]  Lucy J. Colwell,et al.  Rethinking Attention with Performers , 2020, ICLR.

[18]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[19]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[20]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[21]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[22]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[23]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[24]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[25]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[26]  Claire Gardent,et al.  Augmenting Transformers with KNN-Based Composite Memory for Dialog , 2020, TACL.

[27]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[28]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[29]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[30]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[31]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2019, ICLR.

[32]  Sebastian Ruder,et al.  Episodic Memory in Lifelong Language Learning , 2019, NeurIPS.

[33]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[34]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[35]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[36]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[37]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[38]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[39]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[40]  Peter J. Liu,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[41]  Moustapha Cissé,et al.  Unbounded cache model for online language modeling with open vocabulary , 2017, NIPS.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[44]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[45]  Nicolas Usunier,et al.  Improving Neural Language Models with a Continuous Cache , 2016, ICLR.

[46]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[47]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[48]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[50]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[51]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[52]  Barry Kelly,et al.  Memories , 1997, The Ulster medical journal.

[53]  André F. T. Martins,et al.  ∞-former: Infinite Memory Transformer , 2022, ACL.

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.