Meta-Learning Fast Weight Language Models

Dynamic evaluation of language models (LMs) adapts model parameters at test time using gradient information from previous tokens and substantially improves LM performance. However, it requires over 3x more compute than standard inference. We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently by expressing gradient updates as linear attention. A key improvement over dynamic evaluation is that FWLs can also be applied at training time, so the model learns to make good use of gradient updates. FWLs can easily be added on top of existing transformer models, require relatively little extra compute or memory to run, and significantly improve language modeling perplexity.

[1]  Quoc V. Le,et al.  Transformer Quality in Linear Time , 2022, ICML.

[2]  Christopher D. Manning,et al.  Fast Model Editing at Scale , 2021, ICLR.

[3]  Kevin Gimpel,et al.  Reconsidering the Past: Optimizing Hidden States in Language Models , 2021, EMNLP.

[4]  Kazuki Irie,et al.  Going Beyond Linear Transformers with Recurrent Fast Weight Programmers , 2021, NeurIPS.

[5]  Kazuki Irie,et al.  Linear Transformers Are Secretly Fast Weight Programmers , 2021, ICML.

[6]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.

[7]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[8]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2019, ICLR.

[9]  Oriol Vinyals,et al.  Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML , 2019, ICLR.

[10]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[11]  Tsendsuren Munkhdalai,et al.  Metalearned Neural Memory , 2019, NeurIPS.

[12]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[13]  Steve Renals,et al.  Dynamic Evaluation of Transformer Language Models , 2019, ArXiv.

[14]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[15]  Steve Renals,et al.  Dynamic Evaluation of Neural Sequence Models , 2017, ICML.

[16]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[17]  Hong Yu,et al.  Meta Networks , 2017, ICML.

[18]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[19]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[20]  Geoffrey E. Hinton,et al.  Using Fast Weights to Attend to the Recent Past , 2016, NIPS.

[21]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[22]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[23]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[24]  Geoffrey E. Hinton Using fast weights to deblur old memories , 1987 .