Fastformer: Additive Attention is All You Need

Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on long sequences or not effective enough. In this paper, we propose Fastformer, which is an efficient Transformer model based on additive attention. In Fastformer, instead of modeling the pair-wise interactions between tokens, we first use additive attention mechanism to model global contexts, and then further transform each token representation based on its interaction with global context representations. In this way, Fastformer can achieve effective context modeling with linear complexity. Extensive experiments on five datasets show that Fastformer is much more efficient than many existing Transformer models and can meanwhile achieve comparable or even better long text modeling performance.

[1]  Alexander J. Smola,et al.  Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS) , 2014, KDD.

[2]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[5]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[6]  Julian J. McAuley,et al.  Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering , 2016, WWW.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Tat-Seng Chua,et al.  Item Silk Road: Recommending Items from Information Domains to Social Users , 2017, SIGIR.

[9]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[10]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[11]  Suyu Ge,et al.  Neural News Recommendation with Multi-Head Self-Attention , 2019, EMNLP.

[12]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[15]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[16]  Lucy J. Colwell,et al.  Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers , 2020, ArXiv.

[17]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[18]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[19]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[20]  Xing Xie,et al.  Fine-grained Interest Matching for Neural News Recommendation , 2020, ACL.

[21]  Xing Xie,et al.  MIND: A Large-scale Dataset for News Recommendation , 2020, ACL.

[22]  Jianfeng Gao,et al.  UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[23]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[24]  Jiancheng Lv,et al.  Poolingformer: Long Document Modeling with Pooling Attention , 2021, ICML.

[25]  Yi Tay,et al.  Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[26]  Tao Qi,et al.  Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling , 2021, ACL.

[27]  Tao Qi,et al.  Empowering News Recommendation with Pre-trained Language Models , 2021, SIGIR.

[28]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..