Masked Feature Prediction for Self-Supervised Visual Pre-Training

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pretrained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics600, 80.4% on Kinetics-700, 38.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.

[1]  Bolei Zhou,et al.  Network Dissection: Quantifying Interpretability of Deep Visual Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Christoph Feichtenhofer,et al.  Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Gregory Shakhnarovich,et al.  FractalNet: Ultra-Deep Neural Networks without Residuals , 2016, ICLR.

[6]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[7]  Andrew Zisserman,et al.  A Short Note on the Kinetics-700 Human Action Dataset , 2019, ArXiv.

[8]  Jonathan Tompson,et al.  Unsupervised Learning of Spatiotemporally Coherent Metrics , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[12]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[13]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[19]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[20]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Thomas Wolf,et al.  VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning , 2021, ArXiv.

[22]  Furu Wei,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ArXiv.

[23]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[25]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[26]  Andrea Vedaldi,et al.  Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers , 2021, NeurIPS.

[27]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[29]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Trevor Darrell,et al.  Object-Region Video Transformers , 2021, ArXiv.

[31]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[32]  Qi Tian,et al.  Iterative Reorganization With Weak Spatial Constraints: Solving Arbitrary Jigsaw Puzzles for Unsupervised Representation Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[34]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[35]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[36]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[38]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[39]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[40]  Christoph Feichtenhofer,et al.  Improved Multiscale Vision Transformers for Classification and Detection , 2021, ArXiv.

[41]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[42]  Song Han,et al.  TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Devices , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[44]  Serge J. Belongie,et al.  Spatiotemporal Contrastive Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Michael S. Ryoo,et al.  TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? , 2021, ArXiv.

[46]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[47]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Alexei A. Efros,et al.  What Should Not Be Contrastive in Contrastive Learning , 2020, ICLR.

[49]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[50]  Stephen Lin,et al.  What makes instance discrimination good for transfer learning? , 2020, ICLR.

[51]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[53]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[54]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[56]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Bolei Zhou,et al.  Object Detectors Emerge in Deep Scene CNNs , 2014, ICLR.

[61]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[62]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[63]  K. Simonyan,et al.  High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.

[64]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[65]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[67]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[68]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[69]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[70]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[71]  Ildoo Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[72]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[73]  Gedas Bertasius,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[74]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[75]  Abhinav Gupta,et al.  Transitive Invariance for Self-Supervised Visual Representation Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[76]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[77]  Kaiming He,et al.  A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[79]  Zheng Shou,et al.  Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Liangzhe Yuan,et al.  MoViNets: Mobile Video Networks for Efficient Video Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Furu Wei,et al.  Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, ArXiv.

[82]  Stephen Lin,et al.  Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[84]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[85]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[87]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[88]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[90]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[91]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[92]  Aaron B. Adcock,et al.  PyTorchVideo: A Deep Learning Library for Video Understanding , 2021, ACM Multimedia.

[93]  Abhinav Gupta,et al.  Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases , 2020, NeurIPS.

[94]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[95]  Xiaojie Jin,et al.  All Tokens Matter: Token Labeling for Training Better Vision Transformers , 2021, NeurIPS.

[96]  Trevor Darrell,et al.  Learning Features by Watching Objects Move , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).