The Hidden Uniform Cluster Prior in Self-Supervised Learning

A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.

[1]  Doris Y. Tsao,et al.  On the principles of Parsimony and Self-consistency for the emergence of intelligence , 2022, Frontiers of Information Technology & Electronic Engineering.

[2]  Pascal Vincent,et al.  Guillotine Regularization: Improving Deep Networks Generalization by Removing their Head , 2022, ArXiv.

[3]  Yann LeCun,et al.  Intra-Instance VICReg: Bag of Self-Supervised Image Patch Embedding , 2022, ArXiv.

[4]  Yann LeCun,et al.  On the duality between contrastive and non-contrastive self-supervised learning , 2022, ArXiv.

[5]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[6]  Yann LeCun,et al.  Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods , 2022, NeurIPS.

[7]  Michael G. Rabbat,et al.  Masked Siamese Networks for Label-Efficient Learning , 2022, ECCV.

[8]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[9]  Priya Goyal,et al.  Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision , 2022, ArXiv.

[10]  Michael Auli,et al.  data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language , 2022, ICML.

[11]  A. Yuille,et al.  Masked Feature Prediction for Self-Supervised Visual Pre-Training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[14]  Yann LeCun,et al.  VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning , 2021, ICLR.

[15]  Yann LeCun,et al.  A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27 , 2022 .

[16]  Ivan Laptev,et al.  Are Large-scale Datasets Necessary for Self-Supervised Pre-training? , 2021, ArXiv.

[17]  Pascal Vincent,et al.  High Fidelity Visualization of What Your Self-Supervised Representation Knows About , 2021, Trans. Mach. Learn. Res..

[18]  Tao Kong,et al.  iBOT: Image BERT Pre-Training with Online Tokenizer , 2021, ArXiv.

[19]  Aäron van den Oord,et al.  Divide and Contrast: Self-supervised Learning from Uncurated Data , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Armand Joulin,et al.  Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Yann LeCun,et al.  Barlow Twins: Self-Supervised Learning via Redundancy Reduction , 2021, ICML.

[23]  Yuandong Tian,et al.  Understanding self-supervised Learning Dynamics without Contrastive Pairs , 2021, ICML.

[24]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Ting Chen,et al.  Intriguing Properties of Contrastive Losses , 2020, NeurIPS.

[26]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[27]  Charles Blundell,et al.  Representation Learning via Invariant Causal Mechanisms , 2020, ICLR.

[28]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[29]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[30]  Michael Rabbat,et al.  Supervision Accelerates Pre-training in Contrastive Semi-Supervised Learning of Visual Representations. , 2020 .

[31]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[32]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[33]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[34]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[35]  Matthieu Cord,et al.  Learning Representations by Predicting Bags of Visual Words , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[37]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Yuki M. Asano,et al.  Self-labelling via simultaneous clustering and representation learning , 2019, ICLR.

[40]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[41]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[42]  Quoc V. Le,et al.  Unsupervised Data Augmentation , 2019, ArXiv.

[43]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[44]  Pranjal Awasthi,et al.  Fair k-Center Clustering for Data Summarization , 2019, ICML.

[45]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[46]  Napat Rujeerapaiboon,et al.  Size Matters: Cardinality-Constrained Clustering and Outlier Detection via Conic Optimization , 2017, SIAM J. Optim..

[47]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[48]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[49]  Yang Song,et al.  The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[51]  Masashi Sugiyama,et al.  Learning Discrete Representations via Information Maximizing Self-Augmented Training , 2017, ICML.

[52]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Sadaaki Miyamoto,et al.  Spherical k-Means++ Clustering , 2015, MDAI.

[55]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[56]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[57]  Ravishankar Krishnaswamy,et al.  Relax, No Need to Round: Integrality of Clustering Formulations , 2014, ITCS.

[58]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[59]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[60]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[61]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[62]  Jiye Liang,et al.  The $K$-Means-Type Algorithms Versus Imbalanced Data Distributions , 2012, IEEE Transactions on Fuzzy Systems.

[63]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[64]  Fei Wang,et al.  Learning a Bi-Stochastic Data Similarity Matrix , 2010, 2010 IEEE International Conference on Data Mining.

[65]  Andreas Krause,et al.  Discriminative Clustering by Regularized Information Maximization , 2010, NIPS.

[66]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[67]  Hui Xiong,et al.  Adapting the right measures for K-means clustering , 2009, KDD.

[68]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[69]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .

[70]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.

[71]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[72]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[73]  David J. C. MacKay,et al.  Unsupervised Classifiers, Mutual Information and 'Phantom Targets' , 1991, NIPS.

[74]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.