Demystifying Self-Supervised Learning: An Information-Theoretical Framework

Self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as masked language modeling (e.g., BERT) for natural language processing and contrastive visual representation learning (e.g., SimCLR) for computer vision applications. In this paper, we present a theoretical framework explaining that self-supervised learning is likely to work under the assumption that only the shared information (e.g., contextual information or content) between the input (e.g., non-masked words or original images) and self-supervised signals (e.g., masked-words or augmented images) contributes to downstream tasks. Under this assumption, we demonstrate that self-supervisedly learned representation can extract task-relevant and discard task-irrelevant information. We further connect our theoretical analysis to popular contrastive and predictive (self-supervised) learning objectives. In the experimental section, we provide controlled experiments on two popular tasks: 1) visual representation learning with various self-supervised learning objectives to empirically support our analysis; and 2) visual-textual representation learning to challenge that input and self-supervised signal lie in different modalities.

[1]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[3]  Makoto Yamada,et al.  Neural Methods for Point-wise Dependency Estimation , 2020, NeurIPS.

[4]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[5]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[6]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[7]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Mohammad S. Sorower A Literature Survey on Algorithms for Multi-label Learning , 2010 .

[9]  Jason D. Lee,et al.  Predicting What You Already Know Helps: Provable Self-Supervised Learning , 2020, ArXiv.

[10]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[11]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Chen Sun,et al.  What makes for good views for contrastive learning , 2020, NeurIPS.

[14]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[18]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[19]  Akshay Krishnamurthy,et al.  Contrastive learning, multi-view redundancy, and linear models , 2020, ALT.

[20]  Stefano Ermon,et al.  Understanding the Limitations of Variational Mutual Information Estimators , 2020, ICLR.

[21]  Sham M. Kakade,et al.  An Information Theoretic Framework for Multi-view Learning , 2008, COLT.

[22]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[23]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[24]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[25]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[26]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[27]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[28]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[29]  Zeynep Akata,et al.  Learning Robust Representations via Multi-View Information Bottleneck , 2020, ICLR.

[30]  Cristian S. Calude Information and Randomness: An Algorithmic Perspective , 1994 .

[31]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[32]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[34]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[35]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[36]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[37]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[38]  Lei Yu,et al.  A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[41]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[42]  Dacheng Tao,et al.  A Survey on Multi-view Learning , 2013, ArXiv.

[43]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[44]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[45]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[46]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[47]  Sergey Levine,et al.  Wasserstein Dependency Measure for Representation Learning , 2019, NeurIPS.

[48]  Kristen Grauman,et al.  Learning Image Representations Tied to Ego-Motion , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[50]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.