VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

Recent self-supervised methods for image representation learning are based on maximizing the agreement between embedding vectors from different views of the same image. A trivial solution is obtained when the encoder outputs constant vectors. This collapse problem is often avoided through implicit biases in the learning architecture, that often lack a clear justification or interpretation. In this paper, we introduce VICReg (Variance-Invariance-Covariance Regularization), a method that explicitly avoids the collapse problem with a simple regularization term on the variance of the embeddings along each dimension individually. VICReg combines the variance term with a decorrelation mechanism based on redundancy reduction and covariance regularization, and achieves results on par with the state of the art on several downstream tasks. In addition, we show that incorporating our new variance term into other methods helps stabilize the training and leads to performance improvements.

[1]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[2]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[3]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[4]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Chengxu Zhuang,et al.  Local Aggregation for Unsupervised Learning of Visual Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Yann LeCun,et al.  Learning a similarity metric discriminatively, with application to face verification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[7]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[8]  Yuandong Tian,et al.  Understanding self-supervised Learning Dynamics without Contrastive Pairs , 2021, ICML.

[9]  Nicu Sebe,et al.  Whitening for Self-Supervised Representation Learning , 2020, ICML.

[10]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[11]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[12]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[13]  Shih-Fu Chang,et al.  Unsupervised Embedding Learning via Invariant and Spreading Instance Feature , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[15]  Mohammad Norouzi,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[16]  Abhinav Gupta,et al.  ClusterFit: Improving Generalization of Visual Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[18]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[19]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[22]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[23]  Cordelia Schmid,et al.  What makes for good views for contrastive learning , 2020, NeurIPS.

[24]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[25]  Shaogang Gong,et al.  Unsupervised Deep Learning by Neighbourhood Discovery , 2019, ICML.

[26]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Björn Ommer,et al.  CliqueCNN: Deep Unsupervised Exemplar Learning , 2016, NIPS.

[28]  Nicu Sebe,et al.  Whitening and Coloring transform for GANs , 2018, ArXiv.

[29]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[30]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[31]  Karol J. Piczak ESC: Dataset for Environmental Sound Classification , 2015, ACM Multimedia.

[32]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[33]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[34]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[35]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[36]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[38]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Julien Mairal,et al.  Unsupervised Pre-Training of Image Features on Non-Curated Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[43]  Matthieu Cord,et al.  Learning Representations by Predicting Bags of Visual Words , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Dhruv Batra,et al.  Joint Unsupervised Learning of Deep Representations and Image Clusters , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Andrea Vedaldi,et al.  Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[46]  Junnan Li,et al.  Prototypical Contrastive Learning of Unsupervised Representations , 2020, ICLR.

[47]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[48]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[49]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[50]  Matthieu Cord,et al.  Online Bag-of-Visual-Words Generation for Unsupervised Representation Learning , 2020, ArXiv.

[51]  Yann LeCun,et al.  Barlow Twins: Self-Supervised Learning via Redundancy Reduction , 2021, ICML.

[52]  Razvan Pascanu,et al.  BYOL works even without batch statistics , 2020, ArXiv.

[53]  Yang Song,et al.  The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[56]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.