Attention mechanisms in computer vision: A survey

Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

[1]  P. Lennie,et al.  Early and Late Mechanisms of Surround Suppression in Striate Cortex of Macaque , 2005, The Journal of Neuroscience.

[2]  Wenjun Zeng,et al.  Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-Based Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yi Zhang,et al.  PSANet: Point-wise Spatial Attention Network for Scene Parsing , 2018, ECCV.

[5]  Xiaohua Xie,et al.  SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks , 2021, ICML.

[6]  Christian Poellabauer,et al.  Second-Order Non-Local Attention Networks for Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Xiaogang Wang,et al.  SCAN: Self-and-Collaborative Attention Network for Video Person Re-Identification , 2018, IEEE Transactions on Image Processing.

[8]  Ronald A. Rensink The Dynamic Representation of Scenes , 2000 .

[9]  Sainan Liu,et al.  Attentional ShapeContextNet for Point Cloud Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[11]  Xuming He,et al.  LatentGNN: Learning Efficient Non-local Relations for Visual Recognition , 2019, ICML.

[12]  Shuguang Cui,et al.  PointASNL: Robust Point Clouds Processing Using Nonlocal Neural Networks With Adaptive Sampling , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Cho-Jui Hsieh,et al.  When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations , 2021, ArXiv.

[14]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[15]  Jian Yang,et al.  Selective Kernel Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[17]  D. Ballard,et al.  Eye movements in natural behavior , 2005, Trends in Cognitive Sciences.

[18]  Weiming Dong,et al.  Transformers in computational visual media: A survey , 2021, Computational Visual Media.

[19]  C. Qian,et al.  TAM: Temporal Adaptive Module for Video Recognition , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Matthieu Cord,et al.  ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Qibin Hou,et al.  Rotate to Attend: Convolutional Triplet Attention Module , 2020, ArXiv.

[23]  Hong Liu,et al.  Expectation-Maximization Attention Networks for Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[25]  Hong Liu,et al.  Spatial Pyramid Based Graph Reasoning for Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hongxu Chen,et al.  Is Attention Better Than Matrix Decomposition? , 2021, ICLR.

[27]  Shi-Min Hu,et al.  Beyond Self-Attention: External Attention Using Two Linear Layers for Visual Tasks , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Xiangyu Zhang,et al.  Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Xiaolin Li,et al.  Single Shot Text Detector with Regional Attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Ralph R. Martin,et al.  Can Attention Enable MLPs To Catch Up With CNNs? , 2021, Comput. Vis. Media.

[31]  Jiashi Feng,et al.  Coordinate Attention for Efficient Mobile Network Design , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Loïc Le Folgoc,et al.  Attention U-Net: Learning Where to Look for the Pancreas , 2018, ArXiv.

[34]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Changhu Wang,et al.  Improving Convolutional Networks With Self-Calibrated Convolutions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[37]  Tao Mei,et al.  Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Xiaogang Wang,et al.  Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Furu Wei,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ArXiv.

[41]  L. Spillmann,et al.  Beyond the classical receptive field: The effect of contextual stimuli. , 2015, Journal of vision.

[42]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[43]  Matthias Zwicker,et al.  L2G Auto-encoder: Understanding Point Clouds by Local-to-Global Reconstruction with Hierarchical Self-Attention , 2019, ACM Multimedia.

[44]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[46]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Abhinav Shrivastava,et al.  GTA: Global Temporal Attention for Video Action Understanding , 2020, BMVC.

[48]  Bo Zhao,et al.  Diversified Visual Attention Networks for Fine-Grained Object Classification , 2016, IEEE Transactions on Multimedia.

[49]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Tao Mei,et al.  Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Christian Wolf,et al.  Attentional PointNet for 3D-Object Detection in Point Clouds , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[54]  Hao Wang,et al.  SpSequenceNet: Semantic Segmentation Network on 4D Point Clouds , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[56]  Leonid Sigal,et al.  Interpretable Spatio-Temporal Attention for Video Action Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[57]  Shuicheng Yan,et al.  VOLO: Vision Outlooker for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Shaogang Gong,et al.  Harmonious Attention Network for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[60]  Xiaogang Wang,et al.  Context Encoding for Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Yue Gao,et al.  PVNet: A Joint Convolutional Network of Point Cloud and Multi-View for 3D Shape Recognition , 2018, ACM Multimedia.

[62]  Jun Wang,et al.  MLCVNet: Multi-Level Context VoteNet for 3D Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Shuicheng Yan,et al.  Graph-Based Global Reasoning Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[65]  Cuiling Lan,et al.  Relation-Aware Global Attention for Person Re-Identification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Yang Li,et al.  You Look Twice: GaterNet for Dynamic Filter Selection in CNNs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Gang Sun,et al.  Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks , 2018, NeurIPS.

[68]  Hossein Mobahi,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ArXiv.

[69]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[70]  Errui Ding,et al.  Compact Generalized Non-local Network , 2018, NeurIPS.

[71]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[72]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[73]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[74]  Xilin Chen,et al.  Object-Contextual Representations for Semantic Segmentation , 2019, ECCV.

[75]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[76]  Heng Tao Shen,et al.  Hierarchical LSTMs with Adaptive Attention for Visual Captioning , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[78]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[79]  Matthieu Cord,et al.  Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[80]  Hyo-Eun Kim,et al.  SRM: A Style-Based Recalibration Module for Convolutional Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[81]  Wenxiu Sun,et al.  Decoupled Spatial-Temporal Transformer for Video Inpainting , 2021, ArXiv.

[82]  Lu Yuan,et al.  Dynamic Convolution: Attention Over Convolution Kernels , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[85]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[86]  Feng Wang,et al.  Survey on the attention based RNN model and its applications in computer vision , 2016, ArXiv.

[87]  Yuxin Peng,et al.  Object-Part Attention Model for Fine-Grained Image Classification , 2017, IEEE Transactions on Image Processing.

[88]  D. Tao,et al.  A Survey on Visual Transformer , 2020, ArXiv.

[89]  Stephen Lin,et al.  Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[90]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[91]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[92]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[93]  Nassir Navab,et al.  Recalibrating Fully Convolutional Networks With Spatial and Channel “Squeeze and Excitation” Blocks , 2018, IEEE Transactions on Medical Imaging.

[94]  Shu-Tao Xia,et al.  Second-Order Attention Network for Single Image Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[95]  Yi Yang,et al.  Gated Channel Transformation for Visual Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Yu Cheng,et al.  Jointly Attentive Spatial-Temporal Pooling Networks for Video-Based Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[97]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[98]  Weihong Deng,et al.  Mixed High-Order Attention Network for Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[99]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[100]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.

[101]  Stephen Lin,et al.  Deformable ConvNets V2: More Deformable, Better Results , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[102]  Luc Van Gool,et al.  Spatio-Temporal Channel Correlation Networks for Action Classification , 2018, ECCV.

[103]  Jingdong Wang,et al.  OCNet: Object Context Network for Scene Parsing , 2018, ArXiv.

[104]  Jifeng Dai,et al.  FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[105]  Ralph R. Martin,et al.  Sampling Equivariant Self-Attention Networks for Object Detection in Aerial Images , 2021, IEEE Transactions on Image Processing.

[106]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[107]  Jun Zhu,et al.  Query2Label: A Simple Transformer Way to Multi-Label Classification , 2021, ArXiv.

[108]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[109]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[110]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[111]  Qilong Wang,et al.  Global Second-Order Pooling Convolutional Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[112]  Vladlen Koltun,et al.  Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[113]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[114]  Shuicheng Yan,et al.  End-to-End Comparative Attention Networks for Person Re-Identification , 2016, IEEE Transactions on Image Processing.

[115]  Xiang Bai,et al.  Asymmetric Non-Local Neural Networks for Semantic Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[116]  Rohan Ramanath,et al.  An Attentive Survey of Attention Models , 2019, ACM Trans. Intell. Syst. Technol..

[117]  Lars Petersson,et al.  Bilinear Attention Networks for Person Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[118]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[119]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[120]  Xiaojie Jin,et al.  DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.

[121]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[122]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[123]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[124]  Quoc V. Le,et al.  CoAtNet: Marrying Convolution and Attention for All Data Sizes , 2021, NeurIPS.

[125]  Yu Qiao,et al.  Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos , 2018, IEEE Transactions on Image Processing.

[126]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[127]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[128]  Ralph R. Martin,et al.  PCT: Point cloud transformer , 2020, Computational Visual Media.

[129]  Zhizhong Han,et al.  CF-SIS: Semantic-Instance Segmentation of 3D Point Clouds by Context Fusion with Self-Attention , 2020, ACM Multimedia.

[130]  Bingbing Ni,et al.  Modeling Point Clouds With Self-Attention and Gumbel Subset Sampling , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[131]  Zheng Zhang,et al.  Disentangled Non-Local Neural Networks , 2020, ECCV.

[132]  Stephen Lin,et al.  An Empirical Study of Spatial Attention Mechanisms in Deep Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[133]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[134]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[135]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[136]  Cheng Wang,et al.  Mancs: A Multi-task Attentional Network with Curriculum Sampling for Person Re-Identification , 2018, ECCV.

[137]  Yi Yang,et al.  Pedestrian Alignment Network for Large-scale Person Re-Identification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[138]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[139]  Yi Yang,et al.  Diagnose like a Radiologist: Attention Guided Convolutional Neural Network for Thorax Disease Classification , 2018, ArXiv.

[140]  Ning Qian,et al.  On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[141]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[142]  Yang Yang,et al.  ABD-Net: Attentive but Diverse Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[143]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[144]  Alexander Kolesnikov,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[145]  Stephen Lin,et al.  GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[146]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[147]  Jiashi Feng,et al.  Strip Pooling: Rethinking Spatial Pooling for Scene Parsing , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[148]  Jinhyung Kim,et al.  READ: Reciprocal Attention Discriminator for Image-to-Video Re-identification , 2020, ECCV.

[149]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[150]  Quoc V. Le,et al.  CondConv: Conditionally Parameterized Convolutions for Efficient Inference , 2019, NeurIPS.

[151]  Yun Fu,et al.  Image Super-Resolution Using Very Deep Residual Channel Attention Networks , 2018, ECCV.

[152]  Chongruo Wu,et al.  ResNeSt: Split-Attention Networks , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[153]  Han Zhang,et al.  Co-Occurrent Features in Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[154]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[155]  Dongqing Zhang,et al.  Neural Aggregation Network for Video Face Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[156]  M. Corbetta,et al.  Control of goal-directed and stimulus-driven attention in the brain , 2002, Nature Reviews Neuroscience.

[157]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[158]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[159]  Xiaogang Wang,et al.  Video Person Re-identification with Competitive Snippet-Similarity Aggregation and Co-attentive Snippet Embedding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[160]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[161]  Klaus Dietmayer,et al.  Point Transformer , 2020, IEEE Access.

[162]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[163]  Guodong Guo,et al.  Hierarchical Pyramid Diverse Attention Networks for Face Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[164]  In-So Kweon,et al.  BAM: Bottleneck Attention Module , 2018, BMVC.

[165]  Yun Fu,et al.  Tell Me Where to Look: Guided Attention Inference Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[166]  Thomas Serre,et al.  Learning what and where to attend , 2018, ICLR.

[167]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[168]  Shuicheng Yan,et al.  A2-Nets: Double Attention Networks , 2018, NeurIPS.

[169]  Jiebo Luo,et al.  Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[170]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[171]  Yongdong Zhang,et al.  STAT: Spatial-Temporal Attention Mechanism for Video Captioning , 2020, IEEE Transactions on Multimedia.

[172]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[173]  Jing Xu,et al.  Attention-Aware Compositional Network for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[174]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[175]  Xiaogang Wang,et al.  Multi-context Attention for Human Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[176]  Fei Wu,et al.  FcaNet: Frequency Channel Attention Networks , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[177]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.