Why Do Self-Supervised Models Transfer? Investigating the Impact of Invariance on Downstream Tasks

Self-supervised learning is a powerful paradigm for representation learning on unlabelled images. A wealth of effective new methods based on instance matching rely on data-augmentation to drive learning, and these have reached a rough agreement on an augmentation scheme that optimises popular recognition benchmarks. However, there is strong reason to suspect that different tasks in computer vision require features to encode different (in)variances, and therefore likely require different augmentation strategies. In this paper, we measure the invariances learned by contrastive methods and confirm that they do learn invariance to the augmentations used and further show that this invariance largely transfers to related real-world changes in pose and lighting. We show that learned invariances strongly affect downstream task performance and confirm that different downstream tasks benefit from polar opposite (in)variances, leading to performance loss when the standard augmentation strategy is used. Finally, we demonstrate that a simple fusion of representations with complementary invariances ensures wide transferability to all the diverse downstream tasks considered.

[1]  Quoc V. Le,et al.  Measuring Invariances in Deep Networks , 2009, NIPS.

[2]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[3]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[4]  Xi Wang,et al.  High-Resolution Stereo Datasets with Subpixel-Accurate Ground Truth , 2014, GCPR.

[5]  Christopher C. Pack,et al.  The functional specialization of visual cortex emerges from training parallel pathways with self-supervised predictive learning , 2021, bioRxiv.

[6]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[7]  John A. Perrone,et al.  Using the Properties of Primate Motion Sensitive Neurons to Extract Camera Motion and Depth from Brief 2-D Monocular Image Sequences , 2019, CAIP.

[8]  Gertjan J. Burghouts,et al.  Material-specific adaptation of color invariant features , 2009, Pattern Recognit. Lett..

[9]  Michael S. Brown,et al.  Learning Multi-Scale Photo Exposure Correction , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[11]  Sameer A. Nene,et al.  Columbia Object Image Library (COIL100) , 1996 .

[12]  Yingli Tian,et al.  Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[14]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[17]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  M. Goodale,et al.  Separate visual pathways for perception and action , 1992, Trends in Neurosciences.

[19]  Luigi Gresele,et al.  Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style , 2021, NeurIPS.

[20]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[23]  Francesc Moreno-Noguer,et al.  DaLI: Deformation and Light Invariant Descriptor , 2015, International Journal of Computer Vision.

[24]  Timothy M. Hospedales,et al.  Self-Supervised Representation Learning: Introduction, advances, and challenges , 2021, IEEE Signal Processing Magazine.

[25]  Timothy M. Hospedales,et al.  How Well Do Self-Supervised Models Transfer? , 2020, ArXiv.

[26]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[27]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[28]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[29]  Sergey Levine,et al.  End-to-End Learning of Semantic Grasping , 2017, CoRL.

[30]  Sunghyun Cho,et al.  Real-World Blur Dataset for Learning and Benchmarking Deblurring Algorithms , 2020, ECCV.

[31]  Michael J. Cree,et al.  Estimating heading direction from monocular video sequences using biologically-based sensors , 2016, 2016 International Conference on Image and Vision Computing New Zealand (IVCNZ).

[32]  David C. Noelle,et al.  Ventral-Dorsal Neural Networks: Object Detection Via Selective Attention , 2020, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  Max Welling,et al.  Gauge Equivariant Convolutional Networks and the Icosahedral CNN 1 , 2019 .

[34]  Serge Belongie,et al.  Benchmarking Representation Learning for Natural World Image Collections , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Valerio Biscione,et al.  Learning Translation Invariance in CNNs , 2020, ArXiv.

[36]  Abhinav Gupta,et al.  Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases , 2020, NeurIPS.

[37]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[38]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[39]  Junnan Li,et al.  Prototypical Contrastive Learning of Unsupervised Representations , 2020, ICLR.

[40]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.

[41]  Sinan Kalkan,et al.  Deep Hierarchies in the Primate Visual Cortex: What Can We Learn for Computer Vision? , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Yann LeCun,et al.  Barlow Twins: Self-Supervised Learning via Redundancy Reduction , 2021, ICML.

[43]  Geoff Wyvill,et al.  SIFT and SURF Performance Evaluation against Various Image Deformations on Benchmark Dataset , 2011, 2011 International Conference on Digital Image Computing: Techniques and Applications.

[44]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[45]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[46]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yuanzhi Li,et al.  Toward Understanding the Feature Learning Process of Self-supervised Contrastive Learning , 2021, ICML.

[48]  Alexei A. Efros,et al.  What Should Not Be Contrastive in Contrastive Learning , 2020, ICLR.

[49]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[50]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[51]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[52]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[54]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[55]  J. V. Gemert,et al.  On Translation Invariance in CNNs: Convolutional Layers Can Exploit Absolute Spatial Location , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.