Sparse Attention with Linear Units

Recently, it has been argued that encoderdecoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. ‘switch off’) for some queries, which is not possible with sparsified softmax alternatives.1

[1]  Xuancheng Ren,et al.  Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.

[2]  Shikha Bordia,et al.  Do Attention Heads in BERT Track Syntactic Dependencies? , 2019, ArXiv.

[3]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[4]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[5]  John DeNero,et al.  Adding Interpretable Attention to Neural Translation Models Improves Word Alignment , 2019, ArXiv.

[6]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[7]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[10]  Philipp Koehn,et al.  Saliency-driven Word Alignment Interpretation for Neural Machine Translation , 2019, WMT.

[11]  André F. T. Martins,et al.  Sparse and Constrained Attention for Neural Machine Translation , 2018, ACL.

[12]  Mark Fishel,et al.  Confidence through Attention , 2017, MTSummit.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[15]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[16]  Vlad Niculae,et al.  A Regularized Framework for Sparse and Structured Neural Attention , 2017, NIPS.

[17]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[18]  Kentaro Inui,et al.  Attention is Not Only a Weight: Analyzing Transformers with Vector Norms , 2020, EMNLP.

[19]  Yang Liu,et al.  Accurate Word Alignment Induction from Neural Machine Translation , 2020, EMNLP.

[20]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[21]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[22]  J. Tiedemann,et al.  Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, FINDINGS.

[23]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[24]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[25]  André F. T. Martins,et al.  Sparse Sequence-to-Sequence Models , 2019, ACL.

[26]  Rico Sennrich,et al.  Root Mean Square Layer Normalization , 2019, NeurIPS.

[27]  Christof Monz,et al.  What does Attention in Neural Machine Translation Pay Attention to? , 2017, IJCNLP.

[28]  Rico Sennrich,et al.  Context-Aware Neural Machine Translation Learns Anaphora Resolution , 2018, ACL.

[29]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[30]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[33]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[34]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[35]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[36]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[37]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[38]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[39]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[40]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[41]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.