Vision Transformer Architecture Search

Vision transformers (ViTs) inherited the success of NLP but their structures have not been sufficiently investigated and optimized for visual tasks. One of the simplest solutions is to directly search the optimal one via the widely used neural architecture search (NAS) in CNNs. However, we empirically find this straightforward adaptation would encounter catastrophic failures and be frustratingly unstable for the training of superformer. In this paper, we argue that since ViTs mainly operate on token embeddings with little inductive bias, imbalance of channels for different architectures would worsen the weight-sharing assumption and cause the training instability as a result. Therefore, we develop a new cyclic weight-sharing mechanism for token embeddings of the ViTs, which enables each channel could more evenly contribute to all candidate architectures. Besides, we also propose identity shifting to alleviate the many-to-one issue in superformer and leverage weak augmentation and regularization techniques for more steady training empirically. Based on these, our proposed method, ViTAS, has achieved significant superiority in both DeiTand Twins-based ViTs. For example, with only 1.4G FLOPs budget, our searched architecture has 3.3% ImageNet-1k accuracy than the baseline DeiT. With 3.0G FLOPs, our results achieve 82.0% accuracy on ImageNet-1k, and 45.9% mAP on COCO2017 which is 2.4% superior than other ViTs.

[1]  Xiaojun Chang,et al.  BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Fei Wang,et al.  BCNet: Searching for Network Width with Bilaterally Coupled Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Tao Huang,et al.  GreedyNAS: Towards Fast One-Shot NAS With Greedy Supernet , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Mingjie Sun,et al.  Rethinking the Value of Network Pruning , 2018, ICLR.

[7]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Fei Wang,et al.  Prioritized Architecture Sampling with Monto-Carlo Tree Search , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[11]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[13]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[15]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[18]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[19]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[20]  Fei Wang,et al.  Locally Free Weight Sharing for Network Width Search , 2021, ICLR.

[21]  Fei Wang,et al.  GreedyNASv2: Greedier Search with a Greedy Path Filter , 2021, ArXiv.

[22]  Tao Huang,et al.  Explicitly Learning Topology for Differentiable Neural Architecture Search , 2020, ArXiv.

[23]  Stephen P. Boyd,et al.  General Heuristics for Nonconvex Quadratically Constrained Quadratic Programming , 2017, 1703.07870.

[24]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[25]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[26]  Jiahui Yu,et al.  AutoSlim: Towards One-Shot Architecture Search for Channel Numbers , 2019 .

[27]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[28]  Yuandong Tian,et al.  FP-NAS: Fast Probabilistic Neural Architecture Search , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Chao Xu,et al.  Reborn Filters: Pruning Convolutional Neural Networks with Limited Data , 2020, AAAI.

[30]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[31]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[33]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[34]  Yuandong Tian,et al.  FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Xiangyu Zhang,et al.  Single Path One-Shot Neural Architecture Search with Uniform Sampling , 2019, ECCV.

[37]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[38]  Zhouchen Lin,et al.  Towards Improving the Consistency, Efficiency, and Flexibility of Differentiable Neural Architecture Search , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Minghao Chen,et al.  AutoFormer: Searching Transformers for Visual Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Zhiqiang Shen,et al.  Learning Efficient Convolutional Networks through Network Slimming , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[43]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[44]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[45]  Chunhua Shen,et al.  Twins: Revisiting the Design of Spatial Attention in Vision Transformers , 2021, NeurIPS.

[46]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[47]  Fei Wang,et al.  K-shot NAS: Learnable Weight-Sharing for NAS with K-shot Supernets , 2021, ICML.

[48]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[49]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Fei Wang,et al.  ISTA-NAS: Efficient and Consistent Neural Architecture Search by Sparse Coding , 2020, NeurIPS.