Random Feature Attention

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA’s efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.

[1]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[2]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[3]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[4]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[5]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[8]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[9]  Ankit Singh Rawat,et al.  Sampled Softmax with Random Fourier Features , 2019, NeurIPS.

[10]  Barnabás Póczos,et al.  Fast Function to Function Regression , 2014, AISTATS.

[11]  Wenhu Chen,et al.  Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting , 2019, NeurIPS.

[12]  David P. Woodruff,et al.  Faster Kernel Ridge Regression Using Sketching and Preconditioning , 2016, SIAM J. Matrix Anal. Appl..

[13]  J. Schmidhuber Reducing the Ratio Between Learning Complexity and Number of Time Varying Variables in Fully Recurrent Nets , 1993 .

[14]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[15]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Noah A. Smith,et al.  Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation , 2020, ArXiv.

[18]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[19]  Hermann Ney,et al.  Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification , 2019, IWSLT.

[20]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[21]  James Henderson,et al.  Document-Level Neural Machine Translation with Hierarchical Attention Networks , 2018, EMNLP.

[22]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[23]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[24]  T. Teichmann,et al.  Harmonic Analysis and the Theory of Probability , 1957, The Mathematical Gazette.

[25]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[26]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[27]  Omer Levy,et al.  Blockwise Self-Attention for Long Document Understanding , 2020, EMNLP.

[28]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[29]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[30]  Vikas Sindhwani,et al.  Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..

[31]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[32]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[33]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[34]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[35]  Marcello Federico,et al.  Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.

[36]  Samuel R. Bowman,et al.  ListOps: A Diagnostic Dataset for Latent Tree Learning , 2018, NAACL.

[37]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[38]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[39]  Xing Wang,et al.  Modeling Recurrence for Transformer , 2019, NAACL.

[40]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[41]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[42]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[43]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[44]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[45]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[46]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[47]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[48]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[49]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[50]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[51]  M. Rudelson,et al.  Random Features Methods in Supervised Learning , 2019 .

[52]  Geoffrey E. Hinton,et al.  Using Fast Weights to Attend to the Recent Past , 2016, NIPS.

[53]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[54]  Zhijian Liu,et al.  Lite Transformer with Long-Short Range Attention , 2020, ICLR.

[55]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[56]  Yi Tay,et al.  Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[57]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[58]  Makoto Yamada,et al.  Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel , 2019, EMNLP/IJCNLP.

[59]  Kenneth O. Stanley,et al.  Differentiable plasticity: training plastic neural networks with backpropagation , 2018, ICML.

[60]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[61]  Razvan Pascanu,et al.  Stabilizing Transformers for Reinforcement Learning , 2019, ICML.

[62]  Mohit Iyyer,et al.  Hard-Coded Gaussian Attention for Neural Machine Translation , 2020, ACL.

[63]  Dragomir R. Radev,et al.  The ACL Anthology Network , 2009 .

[64]  Masao Utiyama,et al.  Recurrent Positional Embedding for Neural Machine Translation , 2019, EMNLP/IJCNLP.

[65]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[66]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[67]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[68]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[69]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[70]  Klaus-Robert Müller,et al.  An Empirical Study on The Properties of Random Bases for Kernel Methods , 2017, NIPS.

[71]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[72]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[73]  Roy Schwartz,et al.  Rational Recurrences , 2018, EMNLP.

[74]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[75]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[76]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.