What makes for good views for contrastive learning

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning. Despite its success, the influence of different view choices has been less studied. In this paper, we use empirical analysis to better understand the importance of view selection, and argue that we should reduce the mutual information (MI) between views while keeping task-relevant information intact. To verify this hypothesis, we devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI. We also consider data augmentation as a way to reduce MI, and show that increasing data augmentation indeed leads to decreasing MI and improves downstream classification accuracy. As a by-product, we also achieve a new state-of-the-art accuracy on unsupervised pre-training for ImageNet classification ($73\%$ top-1 linear readoff with a ResNet-50). In addition, transferring our models to PASCAL VOC object detection and COCO instance segmentation consistently outperforms supervised pre-training. Code:this http URL

[1]  Geoffrey Zweig,et al.  Multi-modal Self-Supervision from Generalized Data Transformations , 2020, ArXiv.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[4]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[5]  Alexander A. Alemi,et al.  On Variational Bounds of Mutual Information , 2019, ICML.

[6]  Joon Son Chung,et al.  Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Cordelia Schmid,et al.  Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[8]  Andrew Zisserman,et al.  Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[9]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[10]  Anil K. Jain Fundamentals of Digital Image Processing , 2018, Control of Color Imaging Systems.

[11]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[13]  Alexander Kolesnikov,et al.  Revisiting Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Stephen Lin,et al.  Distilling Localization for Self-Supervised Representation Learning , 2020, ArXiv.

[16]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[17]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[19]  Quoc V. Le,et al.  RandAugment: Practical data augmentation with no separate search , 2019, ArXiv.

[20]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[21]  Stefano Soatto,et al.  Visual Representations: Defining Properties and Deep Approximations , 2014, ICLR 2016.

[22]  R Devon Hjelm,et al.  Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[23]  Alexei A. Efros,et al.  Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Nuno Vasconcelos,et al.  Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, ArXiv.

[25]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Ralph Linsker,et al.  Self-organization in a perceptual network , 1988, Computer.

[27]  Honglak Lee,et al.  An efficient framework for learning sentence representations , 2018, ICLR.

[28]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[29]  Limin Wang,et al.  Learning Spatiotemporal Features via Video and Text Pair Discrimination , 2020, ArXiv.

[30]  Alexei A. Efros,et al.  Improving Generalization via Scalable Neighborhood Component Analysis , 2018, ECCV.

[31]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[32]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[33]  T. Asano,et al.  ENTROPY , RELATIVE ENTROPY , AND MUTUAL INFORMATION , 2008 .

[34]  Jiebo Luo,et al.  AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations Rather Than Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Daniel Yamins,et al.  Unsupervised Learning From Video With Deep Neural Embeddings , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Chengxu Zhuang,et al.  Local Aggregation for Unsupervised Learning of Visual Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[41]  Sergey Levine,et al.  Time-Contrastive Networks: Self-Supervised Learning from Video , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[43]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[44]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[46]  Yonglong Tian,et al.  Contrastive Representation Distillation , 2019, ICLR.

[47]  Jorge Luis Borges,et al.  Funes the Memorious , 1942 .

[48]  Ian S. Fischer,et al.  The Conditional Entropy Bottleneck , 2020, Entropy.

[49]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[50]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[51]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[52]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[53]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[54]  Ohad Shamir,et al.  Learning and generalization with the information bottleneck , 2008, Theoretical Computer Science.

[55]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[56]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[57]  Eirikur Agustsson,et al.  NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[58]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[59]  Noah D. Goodman,et al.  On the Importance of Views in Unsupervised Representation Learning , 2020 .

[60]  T. Cover,et al.  Entropy, Relative Entropy and Mutual Information , 2001 .

[61]  Edward H. Adelson,et al.  Learning visual groups from co-occurrences in space and time , 2015, ArXiv.

[62]  S. Gelly,et al.  Self-Supervised Learning of Video-Induced Visual Invariances , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[64]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[65]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[66]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Karl Stratos,et al.  Formal Limitations on the Measurement of Mutual Information , 2018, AISTATS.

[68]  Olivier Bachem,et al.  Automatic Shortcut Removal for Self-Supervised Representation Learning , 2020, ICML.

[69]  Michael Tschannen,et al.  On Mutual Information Maximization for Representation Learning , 2019, ICLR.

[70]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[71]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[73]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[74]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Pieter Abbeel,et al.  CURL: Contrastive Unsupervised Representations for Reinforcement Learning , 2020, ICML.

[76]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[77]  Eero P. Simoncelli 4.7 – Statistical Modeling of Photographic Images , 2005 .

[78]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[79]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Jeff Donahue,et al.  Large Scale Adversarial Representation Learning , 2019, NeurIPS.

[81]  Yuning Jiang,et al.  MegDet: A Large Mini-Batch Object Detector , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[82]  Shih-Fu Chang,et al.  Unsupervised Embedding Learning via Invariant and Spreading Instance Feature , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[84]  Ali Farhadi,et al.  Watching the World Go By: Representation Learning from Unlabeled Videos , 2020, ArXiv.

[85]  Andrea Vedaldi,et al.  Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[86]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Alexander Kolesnikov,et al.  S4L: Self-Supervised Semi-Supervised Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[88]  Lei Yu,et al.  A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[89]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[90]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.