Swin Transformer V2: Scaling Up Capacity and Resolution

We present techniques for scaling Swin Transformer [35] up to 3 billion parameters and making it capable of training with images of up to 1,536×1,536 resolution. By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks: 84.0% top-1 accuracy on ImageNet-V2 image classification, 63.1 / 54.4 box / mask mAP on COCO object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy on Kinetics-400 video action classification. Our techniques are generally applicable for scaling up vision models, which has not been widely explored as that of NLP language models, partly due to the following difficulties in training and applications: 1) vision models often face instability issues at scale and 2) many downstream vision tasks require high resolution images or windows and it is not clear how to effectively transfer models pre-trained at low resolutions to higher resolution ones. The GPU memory consumption is also a problem when the image resolution is high. To address these issues, we present several techniques, which are illustrated by using Swin Transformer as a case study: 1) a post normalization technique and a scaled cosine attention approach to improve the stability of large vision models; 2) a log-spaced continuous position bias technique to effectively transfer models pre-trained at low-resolution images and windows to their higher-resolution counterparts. In addition, we share our crucial implementation details that lead to significant savings of GPU memory consumption and thus make it feasible to train large vision models with regular GPUs. Using these techniques and selfsupervised pre-training, we successfully train a strong 3 billion Swin Transformer model and effectively transfer it to various vision tasks involving high-resolution images or windows, achieving the state-of-the-art accuracy on a variety of benchmarks. Code will be available at https:// *Equal. †Project lead. Ze, Yutong, Zhuliang, Zhenda, Yixuan, Jia are long-term interns at MSRA. xl

[1]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[2]  Luc Van Gool,et al.  LocalViT: Bringing Locality to Vision Transformers , 2021, ArXiv.

[3]  Shuicheng Yan,et al.  VOLO: Vision Outlooker for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Nenghai Yu,et al.  CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Cheng He,et al.  FaPN: Feature-aligned Pyramid Network for Dense Image Prediction , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[7]  Stephen Lin,et al.  Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Hongyang Chao,et al.  Rethinking and Improving Relative Position Encoding for Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[11]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Lu Yuan,et al.  Dynamic Head: Unifying Object Detection Heads with Attentions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Lu Yuan,et al.  Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding , 2021, ArXiv.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Quoc V. Le,et al.  CoAtNet: Marrying Convolution and Attention for All Data Sizes , 2021, NeurIPS.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Chien-Yao Wang,et al.  You Only Learn One Representation: Unified Network for Multiple Tasks , 2021, J. Inf. Sci. Eng..

[18]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[19]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[20]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[21]  Furu Wei,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ArXiv.

[22]  Tingting Liang,et al.  CBNet: A Composite Backbone Network Architecture for Object Detection , 2021, IEEE Transactions on Image Processing.

[23]  Klaus-Robert Müller,et al.  SchNet: A continuous-filter convolutional neural network for modeling quantum interactions , 2017, NIPS.

[24]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[25]  Zheng Zhang,et al.  A Closer Look at Local Aggregation Operators in Point Cloud Analysis , 2020, ECCV.

[26]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[27]  Raquel Urtasun,et al.  Deep Parametric Continuous Convolutional Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Jianmin Bao,et al.  SimMIM: A Simple Framework for Masked Image Modeling , 2021, ArXiv.

[29]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[30]  Jianfeng Gao,et al.  UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[31]  Quoc V. Le,et al.  Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Trevor Darrell,et al.  Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[33]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[34]  Carlos Riquelme,et al.  Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.

[35]  Jian Sun,et al.  Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[37]  Michael S. Ryoo,et al.  TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? , 2021, ArXiv.

[38]  Alexander G. Schwing,et al.  Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[39]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[40]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[41]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Tie-Yan Liu,et al.  Rethinking Positional Encoding in Language Pre-training , 2020, ICLR.

[43]  Yutong Lin,et al.  Leveraging Batch Normalization for Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[44]  Lu Yuan,et al.  Focal Self-attention for Local-Global Interactions in Vision Transformers , 2021, ArXiv.

[45]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Samuel L. Smith,et al.  Characterizing signal propagation to close the performance gap in unnormalized ResNets , 2021, ICLR.

[47]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[48]  Zilong Huang,et al.  Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer , 2021, ArXiv.

[49]  Xiang Bai,et al.  End-to-End Semi-Supervised Object Detection with Soft Teacher , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Olatunji Ruwase,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[51]  Armand Joulin,et al.  Self-supervised Pretraining of Visual Features in the Wild , 2021, ArXiv.

[52]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[53]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[54]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[55]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[56]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[57]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[58]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[59]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Lucas Beyer,et al.  Big Transfer (BiT): General Visual Representation Learning , 2020, ECCV.

[61]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[63]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[64]  Chunhua Shen,et al.  Twins: Revisiting the Design of Spatial Attention in Vision Transformers , 2021, NeurIPS.