Rethinking the Design Principles of Robust Vision Transformer

Recent advances on Vision Transformers (ViT) have shown that self-attention-based networks, which take advantage of long-range dependencies modeling ability, surpassed traditional convolution neural networks (CNNs) in most vision tasks. To further expand the applicability for computer vision, many improved variants are proposed to re-design the Transformer architecture by considering the superiority of CNNs, i.e., locality, translation invariance, for better performance. However, these methods only consider the standard accuracy or computation cost of the model. In this paper, we rethink the design principles of ViTs based on the robustness. We found some design components greatly harm the robustness and generalization ability of ViTs while some others are beneficial. By combining the robust design components, we propose Robust Vision Transformer (RVT). RVT is a new vision transformer, which has superior performance and strong robustness. We further propose two new plug-and-play techniques called position-aware attention rescaling and patch-wise augmentation to train our RVT. The experimental results on ImageNet and six robustness benchmarks show the advanced robustness and generalization ability of RVT compared with previous Transformers and state-of-the-art CNNs. Our RVT-S∗ also achieves Top-1 rank on multiple robustness leaderboards including ImageNet-C and ImageNet-Sketch. The code will be available at https://github.com/ vtddggg/Robust-Vision-Transformer.

[1]  Luc Van Gool,et al.  LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[2]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Seong Joon Oh,et al.  Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Tie-Yan Liu,et al.  Rethinking Positional Encoding in Language Pre-training , 2020, ICLR.

[5]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Matthijs Douze,et al.  LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Dawn Song,et al.  Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Laura Leal-Taixe,et al.  TrackFormer: Multi-Object Tracking with Transformers , 2021, ArXiv.

[12]  Dmytro Mishkin,et al.  Kornia: an Open Source Differentiable Computer Vision Library for PyTorch , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[13]  Pichao Wang,et al.  TransReID: Transformer-based Object Re-Identification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  Balaji Lakshminarayanan,et al.  AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty , 2020, ICLR.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[18]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[19]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[20]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[21]  Levent Sagun,et al.  ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases , 2021, ICML.

[22]  Richard Zhang,et al.  Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[23]  Fengwei Yu,et al.  Incorporating Convolution Designs into Visual Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[25]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[26]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[27]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[28]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[29]  Amnon Shashua,et al.  Inductive Bias of Deep Convolutional Networks through Pooling Geometry , 2016, ICLR.

[30]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[31]  Eric P. Xing,et al.  Learning Robust Global Representations by Penalizing Local Predictive Power , 2019, NeurIPS.

[32]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[33]  Bo Zhang,et al.  Do We Really Need Explicit Position Encodings for Vision Transformers? , 2021, ArXiv.

[34]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[35]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36]  D. Song,et al.  The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Matthias Bethge,et al.  A Simple Way to Make Neural Networks Robust Against Diverse Image Corruptions , 2020, ECCV.