Long-Short Transformer: Efficient Transformers for Language and Vision

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3× as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results (e.g., Top-1 accuracy 84.1% trained on 224× 224 ImageNet-1K only), while being more scalable on high-resolution images. The models and source code will be released soon.

[1]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[2]  Dilin Wang,et al.  Improve Vision Transformers Training by Suppressing Over-smoothing , 2021, ArXiv.

[3]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[4]  Lu Yuan,et al.  Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding , 2021, ArXiv.

[5]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[6]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[7]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[8]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[9]  Glenn M. Fung,et al.  Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention , 2021, AAAI.

[10]  Lei Huang,et al.  Centered Weight Normalization in Accelerating Training of Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[12]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[13]  Jonathon Shlens,et al.  Scaling Local Self-Attention for Parameter Efficient Visual Backbones , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  D. Song,et al.  The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[19]  Jes'us Villalba,et al.  Hierarchical Transformers for Long Document Classification , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[20]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[21]  Omer Levy,et al.  Blockwise Self-Attention for Long Document Understanding , 2020, EMNLP.

[22]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[23]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[25]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[27]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[28]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[30]  Zheng Zhang,et al.  BP-Transformer: Modelling Long-Range Context via Binary Partitioning , 2019, ArXiv.

[31]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[32]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[33]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[34]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[35]  Samuel R. Bowman,et al.  ListOps: A Diagnostic Dataset for Latent Tree Learning , 2018, NAACL.

[36]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[39]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[40]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[41]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[42]  Zhen Qin,et al.  OmniNet: Omnidirectional Representations from Transformers , 2021, ICML.

[43]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[44]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[46]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[47]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[48]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[49]  Kaitao Song,et al.  PVTv2: Improved Baselines with Pyramid Vision Transformer , 2021, ArXiv.

[50]  Roy Schwartz,et al.  Random Feature Attention , 2021, ICLR.

[51]  Aleksander Madry,et al.  Noise or Signal: The Role of Image Backgrounds in Object Recognition , 2020, ICLR.

[52]  Dawn Song,et al.  Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[54]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[55]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[56]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[57]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[58]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[59]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[60]  Xiaohua Zhai,et al.  Are we done with ImageNet? , 2020, ArXiv.

[61]  Zhijian Liu,et al.  Lite Transformer with Long-Short Range Attention , 2020, ICLR.

[62]  Yi Tay,et al.  Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.