FLatten Transformer: Vision Transformer using Focused Linear Attention

The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear complexity by approximating the Softmax operation through carefully designed mapping functions. However, current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead from the mapping functions. In this paper, we propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness. Specifically, we first analyze the factors contributing to the performance degradation of linear attention from two perspectives: the focus ability and feature diversity. To overcome these limitations, we introduce a simple yet effective mapping function and an efficient rank restoration module to enhance the expressiveness of self-attention while maintaining low computation complexity. Extensive experiments show that our linear attention module is applicable to a variety of advanced vision Transformers, and achieves consistently improved performances on multiple benchmarks. Code is available at https://github.com/LeapLabTHU/FLatten-Transformer.

[1]  S. Song,et al.  Dynamic Perceiver for Efficient Visual Recognition , 2023, ArXiv.

[2]  S. Song,et al.  Adaptive Rotated Convolution for Rotated Object Detection , 2023, ArXiv.

[3]  Yu Cao,et al.  Deep Incubation: Training Large Models by Divide-and-Conquering , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Bichen Wu,et al.  Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  S. Song,et al.  EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones , 2022, ArXiv.

[6]  S. Song,et al.  Contrastive Language-Image Pre-Training with Knowledge Graphs , 2022, NeurIPS.

[7]  S. Song,et al.  Latency-aware Spatial-wise Dynamic Networks , 2022, NeurIPS.

[8]  S. Song,et al.  Learning to Weight Samples for Dynamic Early-exiting Networks , 2022, ECCV.

[9]  Cheng-Yang Fu,et al.  Hydra Attention: Efficient Attention with Many Heads , 2022, ECCV Workshops.

[10]  Zhirong Yang,et al.  Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Humphrey Shi,et al.  Neighborhood Attention Transformer , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ross B. Girshick,et al.  Exploring Plain Vision Transformer Backbones for Object Detection , 2022, ECCV.

[13]  Junjie Yan,et al.  cosFormer: Rethinking Softmax in Attention , 2022, ICLR.

[14]  S. Song,et al.  Vision Transformer with Deformable Attention , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  S. Song,et al.  On the Integration of Self-Attention and Convolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  T. Xiang,et al.  SOFT: Softmax-free Transformer with Linear Complexity , 2021, NeurIPS.

[17]  Mohammad Rastegari,et al.  MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer , 2021, ICLR.

[18]  Lu Yuan,et al.  Mobile-Former: Bridging MobileNet and Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Kai Han,et al.  CMT: Convolutional Neural Networks Meet Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[21]  Jure Leskovec,et al.  Combiner: Full Attention Transformer with Sparse Computation Cost , 2021, NeurIPS.

[22]  Nenghai Yu,et al.  CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Trevor Darrell,et al.  Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[24]  P. Luo,et al.  PVT v2: Improved baselines with Pyramid Vision Transformer , 2021, Computational Visual Media.

[25]  Shuicheng Yan,et al.  VOLO: Vision Outlooker for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Luke Zettlemoyer,et al.  Luna: Linear Unified Nested Attention , 2021, NeurIPS.

[27]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[28]  Zeyi Huang,et al.  Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition , 2021, NeurIPS.

[29]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[30]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Gao Huang,et al.  Dynamic Neural Networks: A Survey , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Glenn M. Fung,et al.  Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention , 2021, AAAI.

[33]  Francis E. H. Tay,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[35]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[36]  Le Yang,et al.  Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification , 2020, NeurIPS.

[37]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[38]  Lucy J. Colwell,et al.  Rethinking Attention with Performers , 2020, ICLR.

[39]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[40]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[41]  Le Yang,et al.  Resolution Adaptive Networks for Efficient Inference , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[43]  Xuancheng Ren,et al.  Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.

[44]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[45]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Kaiming He,et al.  Panoptic Feature Pyramid Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Shuai Yi,et al.  Efficient Attention: Attention with Linear Complexities , 2018, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[48]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[49]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[52]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[53]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[54]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[55]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[56]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[57]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[58]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[60]  Han Cai,et al.  EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition , 2022, ArXiv.

[61]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).