论文信息 - Synthesizer: Rethinking Self-Attention in Transformer Models - 字舞流文

Synthesizer: Rethinking Self-Attention in Transformer Models

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that important after all. To this end, we propose \textsc{Synthesizer}, a model that learns synthetic attention weights without token-token interactions. Our experimental results show that \textsc{Synthesizer} is competitive against vanilla Transformer models across a range of tasks, including MT (EnDe, EnFr), language modeling (LM1B), abstractive summarization (CNN/Dailymail), dialogue generation (PersonaChat) and Multi-task language understanding (GLUE, SuperGLUE).

Yi Tay | Donald Metzler | Zhe Zhao | Dara Bahri | Da-Cheng Juan | Che Zheng

[1] Han Fang,et al. Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[2] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[3] Liu Yang,et al. Sparse Sinkhorn Attention , 2020, ICML.

[4] J. Tiedemann,et al. Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, FINDINGS.

[5] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[6] Martin Jaggi,et al. On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[7] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[8] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[9] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[10] Felix Wu,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[11] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[12] Andrew M. Dai,et al. Music Transformer: Generating Music with Long-Term Structure , 2018, ICLR.

[13] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.

[14] Ali Farhadi,et al. Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension , 2018, EMNLP.

[15] Samuel R. Bowman,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[16] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.

[17] Jason Weston,et al. Personalizing Dialogue Agents: I have a dog, do you have pets too? , 2018, ACL.

[18] Ming Zhou,et al. Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[19] Hannes Schulz,et al. Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation , 2017, ArXiv.

[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[21] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[22] Jakob Uszkoreit,et al. A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[23] Mirella Lapata,et al. Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[24] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[25] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.

[26] Jason Weston,et al. Memory Networks , 2014, ICLR.

[27] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.