论文信息 - Refiner: Refining Self-attention for Vision Transformers

Refiner: Refining Self-attention for Vision Transformers

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the selfattention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention—features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters. Code is publicly available at https://github.com/zhoudaquan/Refiner_ViT.

[1] Shuicheng Yan,et al. Scale-Aware Fast R-CNN for Pedestrian Detection , 2015, IEEE Transactions on Multimedia.

[2] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[4] Xiaogang Wang,et al. Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[6] Chunhua Shen,et al. End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Irwan Bello. LambdaNetworks: Modeling Long-Range Interactions Without Attention , 2021, ICLR.

[8] Enhua Wu,et al. Transformer in Transformer , 2021, NeurIPS.

[9] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[10] Xiaojie Jin,et al. Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with 56M Parameters on ImageNet , 2021, ArXiv.

[11] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[12] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[13] Baining Guo,et al. Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[15] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[16] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[17] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[18] Yaowei Wang,et al. Conformer: Local Features Coupling Global Representations for Visual Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[20] Eneko Agirre,et al. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[21] Jiashi Feng,et al. Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.

[22] Xuanjing Huang,et al. Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[23] Luowei Zhou,et al. End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24] Xiaohua Zhai,et al. Are we done with ImageNet? , 2020, ArXiv.

[25] Matthieu Cord,et al. Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Geoffrey E. Hinton,et al. Similarity of Neural Network Representations Revisited , 2019, ICML.

[27] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[28] Bo Chen,et al. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[29] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[30] K. Simonyan,et al. High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.

[31] Junying Chen,et al. UP-DETR: Unsupervised Pre-training for Object Detection with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Yi Tay,et al. Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[33] Hongbo Zhang,et al. Quora Question Pairs , 2017 .

[34] Matthijs Douze,et al. Fixing the train-test resolution discrepancy , 2019, NeurIPS.

[35] Hongyang Chao,et al. Learning Joint Spatial-Temporal Transformations for Video Inpainting , 2020, ECCV.

[36] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[37] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[38] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[39] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[40] Raquel Urtasun,et al. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[41] Wen Gao,et al. Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[43] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[44] Luc Van Gool,et al. LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[45] Dilin Wang,et al. Improve Vision Transformers Training by Suppressing Over-smoothing , 2021, ArXiv.

[46] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[48] Ralph R. Martin,et al. PCT: Point cloud transformer , 2020, Computational Visual Media.

[49] Jiashi Feng,et al. Neural Epitome Search for Architecture-Agnostic Network Compression , 2020, ICLR.

[50] Hector J. Levesque,et al. The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[51] Pieter Abbeel,et al. Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Xiaogang Wang,et al. End-to-End Object Detection with Adaptive Clustering Transformer , 2020, BMVC.

[53] Ido Dagan,et al. The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[54] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56] Roy Bar-Haim,et al. The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[57] Xiaojie Jin,et al. DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.

[58] Quanfu Fan,et al. CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[59] Levent Sagun,et al. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases , 2021, ICML.

[60] Shuicheng Yan,et al. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[61] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[62] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[63] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[64] Andreas Loukas,et al. Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth , 2021, ICML.

[65] Honglak Lee,et al. Learning Invariant Representations with Local Transformations , 2012, ICML.