Self-Adaptive Training: Bridging the Supervised and Self-Supervised Learning

We propose self-adaptive training-a unified training algorithm that dynamically calibrates and enhances training processes by model predictions without incurring an extra computational cost-to advance both supervised and self-supervised learning of deep neural networks. We analyze the training dynamics of deep networks on training data that are corrupted by, e.g., random noise and adversarial examples. Our analysis shows that model predictions are able to magnify useful underlying information in data and this phenomenon occurs broadly even in the absence of any label information, highlighting that model predictions could substantially benefit the training processes: self-adaptive training improves the generalization of deep networks under noise and enhances the self-supervised representation learning. The analysis also sheds light on understanding deep learning, e.g., a potential explanation of the recently-discovered double-descent phenomenon in empirical risk minimization and the collapsing issue of the state-of-the-art self-supervised learning algorithms. Experiments on the CIFAR, STL, and ImageNet datasets verify the effectiveness of our approach in three applications: classification with label noise, selective classification, and linear evaluation. To facilitate future research, the code has been made publicly available at https://github.com/LayneH/self-adaptive-training.

[1]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Yann LeCun,et al.  Barlow Twins: Self-Supervised Learning via Redundancy Reduction , 2021, ICML.

[3]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Junnan Li,et al.  Prototypical Contrastive Learning of Unsupervised Representations , 2020, ICLR.

[5]  Sheng Liu,et al.  Early-Learning Regularization Prevents Memorization of Noisy Labels , 2020, NeurIPS.

[6]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[7]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[8]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[9]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[10]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[11]  Matthias Hein,et al.  Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks , 2020, ICML.

[12]  Chao Zhang,et al.  Self-Adaptive Training: beyond Empirical Risk Minimization , 2020, NeurIPS.

[13]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[14]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[15]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yuki M. Asano,et al.  Self-labelling via simultaneous clustering and representation learning , 2019, ICLR.

[17]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Thomas Brox,et al.  SELF: Learning to Filter Noisy Labels with Self-Ensembling , 2019, ICLR.

[19]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[20]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[21]  Samet Oymak,et al.  Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[22]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[23]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24]  Bin Dong,et al.  Distillation ≈ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network , 2019, ArXiv.

[25]  James Bailey,et al.  Symmetric Cross Entropy for Robust Learning With Noisy Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[27]  Ruslan Salakhutdinov,et al.  Deep Gamblers: Learning to Abstain with Portfolio Theory , 2019, NeurIPS.

[28]  Jeff A. Bilmes,et al.  Combating Label Noise in Deep Learning Using Abstention , 2019, ICML.

[29]  Shaogang Gong,et al.  Unsupervised Deep Learning by Neighbourhood Discovery , 2019, ICML.

[30]  André Longtin,et al.  Learning to generalize , 2019, eLife.

[31]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[32]  Ran El-Yaniv,et al.  SelectiveNet: A Deep Neural Network with an Integrated Reject Option , 2019, ICML.

[33]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[34]  Levent Sagun,et al.  A jamming transition from under- to over-parametrization affects generalization in deep learning , 2018, Journal of Physics A: Mathematical and Theoretical.

[35]  Levent Sagun,et al.  The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[36]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[37]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[40]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[41]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Mert R. Sabuncu,et al.  Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , 2018, NeurIPS.

[43]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[44]  Ali Farhadi,et al.  Label Refinery: Improving ImageNet Classification through Label Progression , 2018, ArXiv.

[45]  Kiyoharu Aizawa,et al.  Joint Optimization Framework for Learning with Noisy Labels , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[47]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[48]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[49]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[50]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[51]  Geoffrey E. Hinton,et al.  Who Said What: Modeling Individual Labelers Improves Classification , 2017, AAAI.

[52]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[53]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[54]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[55]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[56]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[57]  Ran El-Yaniv,et al.  Selective Classification for Deep Neural Networks , 2017, NIPS.

[58]  Armand Joulin,et al.  Unsupervised Learning by Predicting Noise , 2017, ICML.

[59]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[60]  Alexei A. Efros,et al.  Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[62]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[63]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[65]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[66]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[68]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[72]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[73]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[74]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[76]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[77]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[78]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[79]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[80]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[81]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[82]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[83]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[84]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[85]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[86]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[87]  Ran El-Yaniv,et al.  On the Foundations of Noise-free Selective Classification , 2010, J. Mach. Learn. Res..

[88]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[89]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[90]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[91]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[92]  M. Opper Statistical Mechanics of Learning : Generalization , 2002 .

[93]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[94]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[95]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[96]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.