论文信息 - Stand-Alone Self-Attention in Vision Models

Stand-Alone Self-Attention in Vision Models

Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with 12% fewer FLOPS and 29% fewer parameters. On COCO object detection, a pure self-attention model matches the mAP of a baseline RetinaNet while having 39% fewer FLOPS and 34% fewer parameters. Detailed ablation studies demonstrate that self-attention is especially impactful when used in later layers. These results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox.

[1] Rafael C. González,et al. Local Determination of a Moving Contrast Edge , 1985, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[3] Kunihiko Fukushima,et al. Neocognitron: A hierarchical neural network capable of visual pattern recognition , 1988, Neural Networks.

[4] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[5] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[6] William Bialek,et al. Statistics of Natural Images: Scaling in the Woods , 1993, NIPS.

[7] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[8] Eero P. Simoncelli,et al. Natural image statistics and neural representation. , 2001, Annual review of neuroscience.

[9] Kunihiko Fukushima,et al. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[10] Jean-Michel Morel,et al. A non-local algorithm for image denoising , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[12] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.

[14] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[16] Christian Szegedy,et al. DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[18] Stéphane Mallat,et al. Rigid-Motion Scattering for Texture Classification , 2014, ArXiv.

[19] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[21] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[22] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[23] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[25] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[26] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[30] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[31] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[32] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[33] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[34] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[35] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[36] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[37] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[38] François Chollet,et al. Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Xi Chen,et al. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[40] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[41] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[43] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Kaiming He,et al. Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Dustin Tran,et al. Image Transformer , 2018, ICML.

[47] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[48] George Papandreou,et al. MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49] George Papandreou,et al. Searching for Efficient Multi-Scale Architectures for Dense Image Prediction , 2018, NeurIPS.

[50] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51] Douglas Eck,et al. Music Transformer , 2018, 1809.04281.

[52] Tomasz Malisiewicz,et al. SuperPoint: Self-Supervised Interest Point Detection and Description , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53] Shuicheng Yan,et al. A2-Nets: Double Attention Networks , 2018, NeurIPS.

[54] Geoffrey E. Hinton,et al. Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures , 2018, NeurIPS.

[55] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.

[56] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57] Gang Sun,et al. Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks , 2018, NeurIPS.

[58] Max Welling,et al. Spherical CNNs , 2018, ICLR.

[59] Vijay Vasudevan,et al. Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61] Gang Sun,et al. Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62] Stephen Lin,et al. Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[63] Stephen Lin,et al. An Empirical Study of Spatial Attention Mechanisms in Deep Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[64] Quoc V. Le,et al. Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65] Bo Chen,et al. MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Max Welling,et al. Gauge Equivariant Convolutional Networks and the Icosahedral CNN 1 , 2019 .

[67] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[68] Quoc V. Le,et al. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[70] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.