MCMAE: Masked Convolution Meets Masked Autoencoders

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding [2, 1, 28, 55] for feature pretraining and multi-scale hybrid convolution-transformer architectures [12, 21, 49, 34, 57] can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our MCMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. MCMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, MCMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% AP box and 2.2% AP mask respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE .

[1]  Xinggang Wang,et al.  Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection , 2022, IEEE International Conference on Computer Vision.

[2]  Ross B. Girshick,et al.  Exploring Plain Vision Transformer Backbones for Object Detection , 2022, ECCV.

[3]  Limin Wang,et al.  VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training , 2022, NeurIPS.

[4]  Jakob Verbeek,et al.  Three things everyone should know about Vision Transformers , 2022, ECCV.

[5]  Shijian Lu,et al.  Accelerating DETR Convergence via Semantic-Aligned Matching , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jian Sun,et al.  Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ping Luo,et al.  Context Autoencoder for Self-Supervised Representation Learning , 2022, Int. J. Comput. Vis..

[8]  Michael Auli,et al.  data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[9]  Hang Su,et al.  DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR , 2022, ICLR.

[10]  Yali Wang,et al.  UniFormer: Unifying Convolution and Self-Attention for Visual Recognition , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  A. Yuille,et al.  Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  J. Malik,et al.  MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Fang Wen,et al.  PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers , 2021, AAAI.

[15]  Ross B. Girshick,et al.  Benchmarking Detection Transfer Learning with Vision Transformers , 2021, ArXiv.

[16]  Han Hu,et al.  SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Tao Kong,et al.  iBOT: Image BERT Pre-Training with Online Tokenizer , 2021, ArXiv.

[18]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Kai Han,et al.  CMT: Convolutional Neural Networks Meet Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Trevor Darrell,et al.  Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[21]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[22]  Quoc V. Le,et al.  CoAtNet: Marrying Convolution and Attention for All Data Sizes , 2021, NeurIPS.

[23]  Roozbeh Mottaghi,et al.  Container: Context Aggregation Network , 2021, NeurIPS.

[24]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Ari S. Morcos,et al.  ConViT: improving vision transformers with soft convolutional inductive biases , 2021, ICML.

[28]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[29]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[30]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Chunhua Shen,et al.  Conditional Positional Encodings for Vision Transformers , 2021, ICLR.

[32]  Gedas Bertasius,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[33]  Francis E. H. Tay,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Peng Gao,et al.  Fast Convergence of DETR with Spatially Modulated Co-Attention , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[37]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[38]  Pieter Abbeel,et al.  Locally Masked Convolution for Autoregressive Models , 2020, UAI.

[39]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[40]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[41]  Zheng Zhang,et al.  Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation , 2020, ECCV.

[42]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[43]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[45]  Ting-Chun Wang,et al.  Image Inpainting for Irregular Holes Using Partial Convolutions , 2018, ECCV.

[46]  Bin Yang,et al.  SBNet: Sparse Blocks Network for Fast Inference , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Laurens van der Maaten,et al.  3D Semantic Segmentation with Submanifold Sparse Convolutional Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[49]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[51]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Laurens van der Maaten,et al.  Submanifold Sparse Convolutional Networks , 2017, ArXiv.

[53]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[55]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[56]  Xiangyu Zhang,et al.  Large Kernel Matters — Improve Semantic Segmentation by Global Convolutional Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Serge J. Belongie,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[59]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[60]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[61]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  B. Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[63]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[64]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[65]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .