KVT: k-NN Attention for Boosting Vision Transformers

Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising performance. A key component in vision transformers is the fully-connected self-attention which is more powerful than CNNs in modelling long range dependencies. However, since the current dense self-attention uses all image patches (tokens) to compute attention matrix, it may neglect locality of images patches and involve noisy tokens (e.g., clutter background and occlusion), leading to a slow training process and potentially degradation of performance. To address these problems, we propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers. Specifically, instead of involving all the tokens for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attention map. The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations, as nearby tokens tend to be more similar than others. In addition, the k-NN attention allows for the exploration of long range correlation and at the same time filter out irrelevant tokens by choosing the most similar tokens from the entire image. Despite its simplicity, we verify, both theoretically and empirically, that $k$-NN attention is powerful in distilling noise from input tokens and in speeding up training. Extensive experiments are conducted by using ten different vision transformer architectures to verify that the proposed k-NN attention can work with any existing transformer architectures to improve its prediction performance.

[1]  Pichao Wang,et al.  TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Xiaojie Jin,et al.  Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet , 2021, ArXiv.

[3]  Matthijs Douze,et al.  XCiT: Cross-Covariance Image Transformers , 2021, NeurIPS.

[4]  Wei Liu,et al.  CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention , 2021, ArXiv.

[5]  Christoph Feichtenhofer,et al.  Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Rong Jin,et al.  Scaled ReLU Matters for Training Vision Transformers , 2021, AAAI.

[7]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[8]  Jie Hu,et al.  Involution: Inverting the Inherence of Convolution for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  W. Marsden I and J , 2012 .

[10]  Irwan Bello LambdaNetworks: Modeling Long-Range Interactions Without Attention , 2021, ICLR.

[11]  Matthieu Cord,et al.  Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[13]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[14]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[15]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Chunhua Shen,et al.  Twins: Revisiting the Design of Spatial Attention in Vision Transformers , 2021, NeurIPS.

[18]  Bo Chen,et al.  MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[20]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[21]  Weiming Dong,et al.  Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer , 2021, AAAI.

[22]  Shuning Chang,et al.  Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation , 2021, HCMA@MM.

[23]  Runwei Ding,et al.  Lifting Transformer for 3D Human Pose Estimation in Video , 2021, ArXiv.

[24]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[26]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[27]  Lu Yuan,et al.  Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding , 2021, ArXiv.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Xiaojie Jin,et al.  DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.

[30]  Sunil Pai,et al.  Convolutional Neural Networks Arise From Ising Models and Restricted Boltzmann Machines , 2016 .

[31]  Guoying Zhao,et al.  TransRPPG: Remote Photoplethysmography Transformer for 3D Mask Face Presentation Attack Detection , 2021, IEEE Signal Processing Letters.

[32]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[33]  Yingda Xia,et al.  Glance-and-Gaze Vision Transformer , 2021, NeurIPS.

[34]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Bo Zhang,et al.  Do We Really Need Explicit Position Encodings for Vision Transformers? , 2021, ArXiv.

[36]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Chunhua Shen,et al.  Twins: Revisiting Spatial Attention Design in Vision Transformers , 2021, ArXiv.

[39]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[40]  Vladlen Koltun,et al.  Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Levent Sagun,et al.  ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases , 2021, ICML.

[43]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[44]  Trevor Darrell,et al.  Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[45]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[46]  Lu Yuan,et al.  Focal Self-attention for Local-Global Interactions in Vision Transformers , 2021, ArXiv.

[47]  Xiaojie Jin,et al.  All Tokens Matter: Token Labeling for Training Better Vision Transformers , 2021, NeurIPS.

[48]  Rong Jin,et al.  CDTrans: Cross-domain Transformer for Unsupervised Domain Adaptation , 2021, ArXiv.

[49]  Yichen Wei,et al.  End-to-End Human Object Interaction Detection with HOI Transformer , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[51]  Pichao Wang,et al.  Exploiting Better Feature Aggregation for Video Object Detection , 2020, ACM Multimedia.

[52]  Fengwei Yu,et al.  Incorporating Convolution Designs into Visual Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Pichao Wang,et al.  Transformer guided geometry model for flow-based unsupervised visual odometry , 2020, Neural Computing and Applications.

[54]  Kai Han,et al.  CMT: Convolutional Neural Networks Meet Vision Transformers , 2021, ArXiv.

[55]  Miguel P. Eckstein,et al.  FoveaTer: Foveated Transformer for Image Classification , 2021, ArXiv.

[56]  Matthijs Douze,et al.  LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[59]  Roozbeh Mottaghi,et al.  Container: Context Aggregation Network , 2021, NeurIPS.

[60]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[61]  Dilin Wang,et al.  Improve Vision Transformers Training by Suppressing Over-smoothing , 2021, ArXiv.

[62]  Peihua Li,et al.  So-ViT: Mind Visual Tokens for Vision Transformer , 2021, ArXiv.

[63]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[64]  Longhui Wei,et al.  Visformer: The Vision-friendly Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Seong Joon Oh,et al.  Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[66]  Xiaojie Jin,et al.  Refiner: Refining Self-attention for Vision Transformers , 2021, ArXiv.

[67]  Huchuan Lu,et al.  Transformer Tracking , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[69]  Luc Van Gool,et al.  LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[70]  Lior Wolf,et al.  Transformer Interpretability Beyond Attention Visualization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[72]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[73]  Xinggang Wang,et al.  MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens , 2021, ArXiv.

[74]  Zilong Huang,et al.  Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer , 2021, ArXiv.

[75]  Long Zhao,et al.  Aggregating Nested Transformers , 2021, ArXiv.

[76]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[78]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[79]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[80]  Jiwen Lu,et al.  DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , 2021, NeurIPS.

[81]  Zhuowen Tu,et al.  Co-Scale Conv-Attentional Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[82]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[83]  Zhen Qin,et al.  OmniNet: Omnidirectional Representations from Transformers , 2021, ICML.

[84]  Zeyi Huang,et al.  Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length , 2021, ArXiv.

[85]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[86]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[87]  Pichao Wang,et al.  Trear: Transformer-Based RGB-D Egocentric Action Recognition , 2021, IEEE Transactions on Cognitive and Developmental Systems.

[88]  Shuicheng Yan,et al.  VOLO: Vision Outlooker for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.