Understanding Self-supervised Learning with Dual Deep Networks

We propose a novel theoretical framework to understand self-supervised learning methods that employ dual pairs of deep ReLU networks (e.g., SimCLR, BYOL). First, we prove that in each SGD update of SimCLR, the weights at each layer are updated by a \emph{covariance operator} that specifically amplifies initial random selectivities that vary across data samples but survive averages over data augmentations, which we show leads to the emergence of hierarchical features, if the input data are generated from a hierarchical latent tree model. With the same framework, we also show analytically that in BYOL, the usage of BatchNorm and a predictor creates an implicit contrastive term, acting as an approximate covariance operator. The term is formed by the inter-play between the zero-mean operation of BatchNorm and the extra predictor in the online network. Extensive ablation studies justify our theoretical findings.

[1]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[2]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[3]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[4]  Chenxi Liu,et al.  Micro-Batch Training with Batch-Channel Normalization and Weight Standardization , 2019 .

[5]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[6]  Surya Ganguli,et al.  Unsupervised Discovery of Demixed, Low-Dimensional Neural Dynamics across Multiple Timescales through Tensor Component Analysis , 2017, Neuron.

[7]  Razvan Pascanu,et al.  BYOL works even without batch statistics , 2020, ArXiv.

[8]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[9]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[10]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[11]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jason D. Lee,et al.  Predicting What You Already Know Helps: Provable Self-Supervised Learning , 2020, ArXiv.

[13]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[14]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[15]  Yuanzhi Li,et al.  Backward Feature Correction: How Deep Learning Performs Deep Learning , 2020, ArXiv.

[16]  Surya Ganguli,et al.  An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.

[17]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[18]  Juan Pino,et al.  Self-Supervised Representations Improve End-to-End Speech Translation , 2020, INTERSPEECH.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  Yuandong Tian,et al.  Student Specialization in Deep ReLU Networks With Finite Width and Input Dimension , 2019 .

[21]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[22]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[23]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[24]  Kyunghyun Cho,et al.  A Framework For Contrastive Self-Supervised Learning And Designing A New Approach , 2020, ArXiv.

[25]  Akshay Krishnamurthy,et al.  Contrastive learning, multi-view redundancy, and linear models , 2020, ALT.

[26]  Alexander Kolesnikov,et al.  Revisiting Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Junnan Li,et al.  Prototypical Contrastive Learning of Unsupervised Representations , 2020, ArXiv.

[28]  Bernard Ghanem,et al.  Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[29]  J. Liao,et al.  Sharpening Jensen's Inequality , 2017, The American Statistician.

[30]  Yuandong Tian,et al.  A theoretical framework for deep locally connected ReLU network , 2018, ArXiv.

[31]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[32]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[33]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[34]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[36]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[37]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[39]  Surya Ganguli,et al.  Statistical Mechanics of Deep Learning , 2020, Annual Review of Condensed Matter Physics.

[40]  Abhinav Gupta,et al.  Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases , 2020, NeurIPS.

[41]  Surya Ganguli,et al.  A mathematical theory of semantic development in deep neural networks , 2018, Proceedings of the National Academy of Sciences.

[42]  David Saad,et al.  Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks , 1995, NIPS.

[43]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Abdel-rahman Mohamed,et al.  Effectiveness of Self-Supervised Pre-Training for ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Abhinav Gupta,et al.  Scaling and Benchmarking Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[47]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.