Self-Adaptive Training: beyond Empirical Risk Minimization

We propose self-adaptive training---a new training algorithm that dynamically corrects problematic training labels by model predictions without incurring extra computational cost---to improve generalization of deep learning for potentially corrupted training data. This problem is crucial towards robustly learning from data that are corrupted by, e.g., label noises and out-of-distribution samples. The standard empirical risk minimization (ERM) for such data, however, may easily overfit noises and thus suffers from sub-optimal performance. In this paper, we observe that model predictions can substantially benefit the training process: self-adaptive training significantly improves generalization over ERM under various levels of noises, and mitigates the overfitting issue in both natural and adversarial training. We evaluate the error-capacity curve of self-adaptive training: the test error is monotonously decreasing w.r.t. model capacity. This is in sharp contrast to the recently-discovered double-descent phenomenon in ERM which might be a result of overfitting of noises. Experiments on CIFAR and ImageNet datasets verify the effectiveness of our approach in two applications: classification with label noise and selective classification. We release our code at this https URL.

[1]  J. L. Graham Learning to generalize. , 1938 .

[2]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[3]  Michael A. Arbib,et al.  The handbook of brain theory and neural networks , 1995, A Bradford book.

[4]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[5]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[6]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[7]  M. Opper Statistical Mechanics of Learning : Generalization , 2002 .

[8]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[9]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[10]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[11]  Ran El-Yaniv,et al.  On the Foundations of Noise-free Selective Classification , 2010, J. Mach. Learn. Res..

[12]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[14]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[20]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[23]  Ran El-Yaniv,et al.  Selective Classification for Deep Neural Networks , 2017, NIPS.

[24]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[25]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[26]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[27]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[29]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[30]  Geoffrey E. Hinton,et al.  Who Said What: Modeling Individual Labelers Improves Classification , 2017, AAAI.

[31]  Mert R. Sabuncu,et al.  Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , 2018, NeurIPS.

[32]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[33]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[34]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[35]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[36]  Ali Farhadi,et al.  Label Refinery: Improving ImageNet Classification through Label Progression , 2018, ArXiv.

[37]  Mikhail Belkin,et al.  Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[38]  Kiyoharu Aizawa,et al.  Joint Optimization Framework for Learning with Noisy Labels , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[40]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[41]  Ran El-Yaniv,et al.  SelectiveNet: A Deep Neural Network with an Integrated Reject Option , 2019, ICML.

[42]  Bin Dong,et al.  Distillation ≈ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network , 2019, ArXiv.

[43]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[44]  Levent Sagun,et al.  A jamming transition from under- to over-parametrization affects generalization in deep learning , 2018, Journal of Physics A: Mathematical and Theoretical.

[45]  James Bailey,et al.  Symmetric Cross Entropy for Robust Learning With Noisy Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[47]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[48]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[49]  Jeff A. Bilmes,et al.  Combating Label Noise in Deep Learning Using Abstention , 2019, ICML.

[50]  Levent Sagun,et al.  The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[51]  André Longtin,et al.  Learning to generalize , 2019, eLife.

[52]  Ruslan Salakhutdinov,et al.  Deep Gamblers: Learning to Abstain with Portfolio Theory , 2019, NeurIPS.

[53]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[54]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Samet Oymak,et al.  Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[56]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[57]  Thomas Brox,et al.  SELF: Learning to Filter Noisy Labels with Self-Ensembling , 2019, ICLR.

[58]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.