MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

Vision transformers (ViTs) have recently demonstrated great success in various computer vision tasks, motivating a tremendously increased interest in their deployment into many real-world IoT applications. However, powerful ViTs are often too computationally expensive to be fitted onto realworld resource-constrained devices, due to (1) their quadratically increased complexity with the number of input tokens and (2) their overparameterized self-attention heads and model depth. In parallel, different images are of varied complexity and their different regions can contain various levels of visual information, e.g., a sky background is not as informative as a foreground object in object classification tasks, indicating that treating all regions/tokens equally in terms of model complexity is unnecessary while such opportunities for trimming down ViTs’ complexity have not been fully explored. To this end, we propose a Multi-grained InputAdaptive Vision TransFormer framework dubbed MIAFormer that can input-adaptively adjust the structure of ViTs at three coarse-to-fine-grained granularities (i.e., model depth and the number of model heads/tokens). In particular, our MIA-Former adopts a low-cost network trained with a hybrid supervised and reinforcement training method to skip unnecessary layers, heads, and tokens in an input adaptive manner, reducing the overall computational cost. Furthermore, an interesting side effect of our MIA-Former is that its resulting ViTs are naturally equipped with improved robustness against adversarial attacks over their static counterparts, because MIA-Former’s multi-grained dynamic control improves the model diversity similar to the effect of ensemble and thus increases the difficulty of adversarial attacks against all its sub-models. Extensive experiments and ablation studies validate that the proposed MIA-Former framework can (1) effectively allocate computation budgets adaptive to the difficulty of input images, achieving state-of-the-art (SOTA) accuracy-efficiency trade-offs, e.g., 20% computation savings with the same or even a higher accuracy compared with SOTA dynamic transformer models, and (2) boost ViTs’ robustness accuracy under various adversarial attacks over their vanilla counterparts by 2.4% and 3.0%, respectively.

[1]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[6]  Yue Wang,et al.  Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference , 2020, AAAI.

[7]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Matthijs Douze,et al.  LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[10]  Zeyi Huang,et al.  Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length , 2021, ArXiv.

[11]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jinfeng Yi,et al.  On the Adversarial Robustness of Visual Transformers , 2021, ArXiv.

[13]  Yuan He,et al.  Rethinking the Design Principles of Robust Vision Transformer , 2021 .

[14]  Yue Wang,et al.  Dual Dynamic Inference: Enabling More Efficient, Adaptive, and Controllable Deep Inference , 2019, IEEE Journal of Selected Topics in Signal Processing.

[15]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[16]  Jiaya Jia,et al.  Exploring and Improving Mobile Level Vision Transformers , 2021, ArXiv.

[17]  Suman Jana,et al.  HYDRA: Pruning Adversarially Robust Neural Networks , 2020, NeurIPS.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Hao Cheng,et al.  Adversarial Robustness vs. Model Compression, or Both? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Larry S. Davis,et al.  BlockDrop: Dynamic Inference Paths in Residual Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Seong Joon Oh,et al.  Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[24]  Tudor Dumitras,et al.  Shallow-Deep Networks: Understanding and Mitigating Network Overthinking , 2018, ICML.

[25]  David A. Wagner,et al.  Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples , 2018, ICML.

[26]  Yingyan Lin,et al.  Double-Win Quant: Aggressively Winning Robustness of Quantized Deep Neural Networks via Random Precision Training and Inference , 2021, ICML.

[27]  Xiaojie Jin,et al.  DeepViT: Towards Deeper Vision Transformer , 2021, ArXiv.

[28]  Chaojian Li,et al.  2-in-1 Accelerator: Enabling Random Precision Switch for Winning Both Adversarial Robustness and Efficiency , 2021, MICRO.

[29]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[30]  Shiyu Chang,et al.  Robust Overfitting may be mitigated by properly learned smoothening , 2021, ICLR.

[31]  Tianlong Chen,et al.  Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling Input-Adaptive Inference , 2020, ICLR.

[32]  Lihi Zelnik-Manor,et al.  ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[33]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[34]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[35]  Jiwen Lu,et al.  DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , 2021, NeurIPS.

[36]  Zhuowen Tu,et al.  Co-Scale Conv-Attentional Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Xu Ouyang,et al.  Drawing Robust Scratch Tickets: Subnetworks with Inborn Robustness Are Found within Randomly Initialized Networks , 2021, ArXiv.

[38]  Alexander Kolesnikov,et al.  How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers , 2021, ArXiv.

[39]  Luc Van Gool,et al.  LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[40]  Nenghai Yu,et al.  CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Jinfeng Yi,et al.  Defend Deep Neural Networks Against Adversarial Examples via Fixed andDynamic Quantized Activation Functions , 2018, ArXiv.

[43]  Kaitao Song,et al.  PVTv2: Improved Baselines with Pyramid Vision Transformer , 2021, ArXiv.

[44]  Li Yang,et al.  Robust Sparse Regularization: Simultaneously Optimizing Neural Network Robustness and Compactness , 2019, ArXiv.

[45]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[46]  Yue Wang,et al.  FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training , 2020, NeurIPS.

[47]  Cordelia Schmid,et al.  Segmenter: Transformer for Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Xin Wang,et al.  SkipNet: Learning Dynamic Routing in Convolutional Networks , 2017, ECCV.