论文信息 - Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction.

[1] Michael Auli,et al. Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language , 2022, ArXiv.

[2] Michael G. Rabbat,et al. The Hidden Uniform Cluster Prior in Self-Supervised Learning , 2022, ICLR.

[3] Yann LeCun,et al. VICRegL: Self-Supervised Learning of Local Visual Features , 2022, NeurIPS.

[4] Z. Tu,et al. Semi-supervised Vision Transformers at Scale , 2022, NeurIPS.

[5] Doris Y. Tsao,et al. On the principles of Parsimony and Self-consistency for the emergence of intelligence , 2022, Frontiers of Information Technology & Electronic Engineering.

[6] Pascal Vincent,et al. Guillotine Regularization: Improving Deep Networks Generalization by Removing their Head , 2022, ArXiv.

[7] Yann LeCun,et al. Intra-Instance VICReg: Bag of Self-Supervised Image Patch Embedding , 2022, ArXiv.

[8] Michael G. Rabbat,et al. Masked Siamese Networks for Label-Efficient Learning , 2022, ECCV.

[9] Michael Auli,et al. data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[10] Ping Luo,et al. Context Autoencoder for Self-Supervised Representation Learning , 2022, ArXiv.

[11] A. Yuille,et al. Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Han Hu,et al. SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Li Dong,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[15] Yann LeCun,et al. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning , 2021, ICLR.

[16] Yann LeCun,et al. A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27 , 2022 .

[17] Ivan Laptev,et al. Are Large-scale Datasets Necessary for Self-Supervised Pre-training? , 2021, ArXiv.

[18] Pascal Vincent,et al. High Fidelity Visualization of What Your Self-Supervised Representation Knows About , 2021, Trans. Mach. Learn. Res..

[19] Tao Kong,et al. iBOT: Image BERT Pre-Training with Online Tokenizer , 2021, ArXiv.

[20] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Armand Joulin,et al. Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22] Saining Xie,et al. An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23] Yann LeCun,et al. Barlow Twins: Self-Supervised Learning via Redundancy Reduction , 2021, ICML.

[24] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[25] Yuandong Tian,et al. Understanding self-supervised Learning Dynamics without Contrastive Pairs , 2021, ICML.

[26] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[27] Xinlei Chen,et al. Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[29] Charles Blundell,et al. Representation Learning via Invariant Causal Mechanisms , 2020, ICLR.

[30] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[31] Michael Rabbat,et al. Supervision Accelerates Pre-training in Contrastive Semi-Supervised Learning of Visual Representations. , 2020 .

[32] Julien Mairal,et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[33] Geoffrey E. Hinton,et al. Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[34] Pierre H. Richemond,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[35] Kaiming He,et al. Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[36] Matthieu Cord,et al. Learning Representations by Predicting Bags of Visual Words , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[38] Laurens van der Maaten,et al. Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Yuki M. Asano,et al. Self-labelling via simultaneous clustering and representation learning , 2019, ICLR.

[41] Michael Tschannen,et al. On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[42] Ali Razavi,et al. Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[43] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[44] André Susano Pinto,et al. A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark , 2019, 1910.04867.

[45] R Devon Hjelm,et al. Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[46] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47] Quoc V. Le,et al. Unsupervised Data Augmentation , 2019, ArXiv.

[48] Yoshua Bengio,et al. Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[49] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[50] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[51] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[52] Stella X. Yu,et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[54] Yang Song,et al. The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55] Yang You,et al. Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[56] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[57] Gregory Shakhnarovich,et al. Colorization as a Proxy Task for Visual Understanding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Masashi Sugiyama,et al. Learning Discrete Representations via Information Maximizing Self-Augmented Training , 2017, ICML.

[59] Harri Valpola,et al. Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[60] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Alexei A. Efros,et al. Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62] Gregory Shakhnarovich,et al. Learning Representations for Automatic Colorization , 2016, ECCV.

[63] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[65] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[66] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[67] Andreas Krause,et al. Discriminative Clustering by Regularized Information Maximization , 2010, NIPS.

[68] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[69] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[70] Fu Jie Huang,et al. A Tutorial on Energy-Based Learning , 2006 .

[71] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[72] Karl J. Friston,et al. A theory of cortical responses , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[73] Rajesh P. N. Rao,et al. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. , 1999 .

[74] Yann LeCun,et al. Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[75] David J. C. MacKay,et al. Unsupervised Classifiers, Mutual Information and 'Phantom Targets' , 1991, NIPS.

[76] Ralph Linsker,et al. Self-organization in a perceptual network , 1988, Computer.