Synthesizer: Rethinking Self-Attention in Transformer Models

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. Our experimental results show that \textsc{Synthesizer} is competitive against vanilla Transformer models across a range of tasks, including MT (EnDe, EnFr), language modeling (LM1B), abstractive summarization (CNN/Dailymail), dialogue generation (PersonaChat) and Multi-task language understanding (GLUE, SuperGLUE).

[1]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[2]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[3]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[4]  J. Tiedemann,et al.  Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, FINDINGS.

[5]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[6]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[7]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[8]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[9]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[10]  Felix Wu,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[11]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[12]  Andrew M. Dai,et al.  Music Transformer: Generating Music with Long-Term Structure , 2018, ICLR.

[13]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[14]  Ali Farhadi,et al.  Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension , 2018, EMNLP.

[15]  Samuel R. Bowman,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[16]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[17]  Jason Weston,et al.  Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[18]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[19]  Hannes Schulz,et al.  Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation , 2017, ArXiv.

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[22]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[23]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[24]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[25]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[26]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.