论文信息 - Augmenting Convolutional networks with attention-based aggregation

Augmenting Convolutional networks with attention-based aggregation

We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attentionbased aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth). In contrast with a pyramidal design, this architecture family maintains the input patch resolution across all the layers. It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption, as shown by our experiments on various computer vision tasks: object classification, image segmentation and detection.

[1] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[2] Lu Yuan,et al. Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding , 2021, ArXiv.

[3] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4] Kurt Keutzer,et al. Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[5] Jonathan Krause,et al. 3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[6] Zijian Zhang,et al. Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[7] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[8] Matthieu Cord,et al. ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[11] Matthijs Douze,et al. XCiT: Cross-Covariance Image Transformers , 2021, NeurIPS.

[12] Ivan Laptev,et al. Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Quoc V. Le,et al. CoAtNet: Marrying Convolution and Attention for All Data Sizes , 2021, NeurIPS.

[15] Mark Sandler,et al. Non-Discriminative Data or Weak Model? On the Relative Importance of Data and Model Resolution , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[16] Matthieu Cord,et al. Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.

[18] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Nenghai Yu,et al. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Guiguang Ding,et al. RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition , 2021, ArXiv.

[21] Kaiming He,et al. Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Lucas Beyer,et al. The Efficiency Misnomer , 2021, ArXiv.

[23] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[24] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[25] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Matthijs Douze,et al. Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[27] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[28] Jeff Donahue,et al. Large Scale Adversarial Representation Learning , 2019, NeurIPS.

[29] Andrea Vedaldi,et al. Interpretable Explanations of Black Boxes by Meaningful Perturbation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30] Hai-Tao Zheng,et al. Are we ready for a new paradigm shift? A Survey on Visual Deep MLP , 2021, ArXiv.

[31] Ross Wightman,et al. ResNet strikes back: An improved training procedure in timm , 2021, ArXiv.

[32] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[33] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34] J. Zico Kolter,et al. Patches Are All You Need? , 2022, Trans. Mach. Learn. Res..

[35] Matthieu Cord,et al. Grafit: Learning fine-grained image representations with coarse labels , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.

[37] Yang Song,et al. The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Bolei Zhou,et al. Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[40] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[41] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[42] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43] Vineeth N. Balasubramanian,et al. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[44] Andrew Zisserman,et al. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[45] Matthijs Douze,et al. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[46] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Yuning Jiang,et al. Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[48] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[49] Iasonas Kokkinos,et al. MultiGrain: a unified image embedding for classes and instances , 2019, ArXiv.

[50] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[51] Ekin D. Cubuk,et al. Revisiting ResNets: Improved Training and Scaling Strategies , 2021, NeurIPS.

[52] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[53] Xuhui Jia,et al. Global Self-Attention Networks for Image Recognition , 2020, ArXiv.

[54] Seong Joon Oh,et al. Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[55] David Picard,et al. Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision , 2021, ArXiv.

[56] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[57] Lior Wolf,et al. Transformer Interpretability Beyond Attention Visualization , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[59] Levent Sagun,et al. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases , 2021, ICML.

[60] Trevor Darrell,et al. Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[61] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[62] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .

[63] Alexander Kolesnikov,et al. MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[64] K. Simonyan,et al. High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.

[65] Ling Shao,et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[66] Cynthia Rudin,et al. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , 2018, Nature Machine Intelligence.

[67] Quoc V. Le,et al. Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68] Prafulla Dhariwal,et al. Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[69] Carlos Guestrin,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.