论文信息 - Self-supervised Learning from a Multi-view Perspective

Self-supervised Learning from a Multi-view Perspective

As a subset of unsupervised representation learning, self-supervised representation learning adopts self-defined signals as supervision and uses the learned representation for downstream tasks, such as object detection and image captioning. Many proposed approaches for self-supervised learning follow naturally a multi-view perspective, where the input (e.g., original images) and the self-supervised signals (e.g., augmented images) can be seen as two redundant views of the data. Building from this multi-view perspective, this paper provides an information-theoretical framework to better understand the properties that encourage successful self-supervised learning. Specifically, we demonstrate that self-supervised learned representations can extract task-relevant information and discard task-irrelevant information. Our theoretical framework paves the way to a larger space of self-supervised learning objective design. In particular, we propose a composite objective that bridges the gap between prior contrastive and predictive learning objectives, and introduce an additional objective term to discard task-irrelevant information. To verify our analysis, we conduct controlled experiments to evaluate the impact of the composite objectives. We also explore our framework's empirical generalization beyond the multi-view perspective, where the cross-view redundancy may not be clearly observed.

Yao-Hung Hubert Tsai | Louis-Philippe Morency | Ruslan Salakhutdinov | Yue Wu

[1] R Devon Hjelm,et al. Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[2] Maria-Florina Balcan,et al. Co-Training and Expansion: Towards Bridging Theory and Practice , 2004, NIPS.

[3] Peter L. Bartlett,et al. The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[4] Aaron C. Courville,et al. MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[5] Julien Mairal,et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[6] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[7] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[8] Zeynep Akata,et al. Learning Robust Representations via Multi-View Information Bottleneck , 2020, ICLR.

[9] Nikos Komodakis,et al. Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[10] Michal Valko,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[11] Ali Razavi,et al. Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[12] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[13] Joshua B. Tenenbaum,et al. Human-level concept learning through probabilistic program induction , 2015, Science.

[14] Jason D. Lee,et al. Predicting What You Already Know Helps: Provable Self-Supervised Learning , 2020, ArXiv.

[15] Karl Stratos,et al. Formal Limitations on the Measurement of Mutual Information , 2018, AISTATS.

[16] Himanshu Asnani,et al. CCMI : Classifier based Conditional Mutual Information Estimation , 2019, UAI.

[17] Barnabás Póczos,et al. Nonparametric Estimation of Conditional Information and Divergences , 2012, AISTATS.

[18] Stefano Ermon,et al. Understanding the Limitations of Variational Mutual Information Estimators , 2020, ICLR.

[19] Phillip Isola,et al. Contrastive Multiview Coding , 2019, ECCV.

[20] Akshay Krishnamurthy,et al. Contrastive learning, multi-view redundancy, and linear models , 2020, ALT.

[21] Phillip Isola,et al. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[22] Richard Zhang,et al. Image Synthesis for Self-Supervised Visual Representation Learning , 2018 .

[23] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24] Neri Merhav,et al. Relations between entropy and error probability , 1994, IEEE Trans. Inf. Theory.

[25] Yoshua Bengio,et al. Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[26] Ralph Linsker,et al. Self-organization in a perceptual network , 1988, Computer.

[27] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[28] Chen Sun,et al. What makes for good views for contrastive learning , 2020, NeurIPS.

[29] Naftali Tishby,et al. The information bottleneck method , 2000, ArXiv.

[30] Dacheng Tao,et al. A Survey on Multi-view Learning , 2013, ArXiv.

[31] Stefano Soatto,et al. Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[32] Mikhail Khodak,et al. A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[33] Cristian Claude,et al. Information and Randomness: An Algorithmic Perspective , 1994 .

[34] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[35] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[36] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37] Jun Du,et al. When Does Cotraining Work in Real Data? , 2011, IEEE Transactions on Knowledge and Data Engineering.

[38] FawcettTom. An introduction to ROC analysis , 2006 .

[39] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[43] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[44] Alexander A. Alemi,et al. On Variational Bounds of Mutual Information , 2019, ICML.

[45] Shai Ben-David,et al. Understanding Machine Learning: From Theory to Algorithms , 2014 .

[46] Sham M. Kakade,et al. An Information Theoretic Framework for Multi-view Learning , 2008, COLT.

[47] Makoto Yamada,et al. Neural Methods for Point-wise Dependency Estimation , 2020, NeurIPS.

[48] Michael Tschannen,et al. On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[49] Jan Kautz,et al. MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[51] Alexei A. Efros,et al. Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).