论文信息 - Self-Adaptive Training: beyond Empirical Risk Minimization

Self-Adaptive Training: beyond Empirical Risk Minimization

We propose self-adaptive training---a new training algorithm that dynamically corrects problematic training labels by model predictions without incurring extra computational cost---to improve generalization of deep learning for potentially corrupted training data. This problem is crucial towards robustly learning from data that are corrupted by, e.g., label noises and out-of-distribution samples. The standard empirical risk minimization (ERM) for such data, however, may easily overfit noises and thus suffers from sub-optimal performance. In this paper, we observe that model predictions can substantially benefit the training process: self-adaptive training significantly improves generalization over ERM under various levels of noises, and mitigates the overfitting issue in both natural and adversarial training. We evaluate the error-capacity curve of self-adaptive training: the test error is monotonously decreasing w.r.t. model capacity. This is in sharp contrast to the recently-discovered double-descent phenomenon in ERM which might be a result of overfitting of noises. Experiments on CIFAR and ImageNet datasets verify the effectiveness of our approach in two applications: classification with label noise and selective classification. We release our code at this https URL.

Hongyang R. Zhang | Lang Huang | Chaoning Zhang | Hongyang Zhang

[1] J. L. Graham. Learning to generalize. , 1938 .

[2] Peter J. Rousseeuw,et al. Robust regression and outlier detection , 1987 .

[3] Michael A. Arbib,et al. The handbook of brain theory and neural networks , 1995, A Bradford book.

[4] Carla E. Brodley,et al. Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[5] Choh-Man Teng,et al. Correcting Noisy Data , 1999, ICML.

[6] Carla E. Brodley,et al. Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[7] M. Opper. Statistical Mechanics of Learning : Generalization , 2002 .

[8] Xindong Wu,et al. Eliminating Class Noise in Large Datasets , 2003, ICML.

[9] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[10] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[11] Ran El-Yaniv,et al. On the Foundations of Noise-free Selective Classification , 2010, J. Mach. Learn. Res..

[12] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[13] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[14] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15] Dumitru Erhan,et al. Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[16] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.

[20] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[23] Ran El-Yaniv,et al. Selective Classification for Deep Neural Networks , 2017, NIPS.

[24] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[25] Yoshua Bengio,et al. A Closer Look at Memorization in Deep Networks , 2017, ICML.

[26] Geoffrey E. Hinton,et al. Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[27] Richard Nock,et al. Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Nir Shavit,et al. Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[29] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[30] Geoffrey E. Hinton,et al. Who Said What: Modeling Individual Labelers Improves Classification , 2017, AAAI.

[31] Mert R. Sabuncu,et al. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , 2018, NeurIPS.

[32] Zachary Chase Lipton,et al. Born Again Neural Networks , 2018, ICML.

[33] Li Fei-Fei,et al. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[34] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[35] Aleksander Madry,et al. Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[36] Ali Farhadi,et al. Label Refinery: Improving ImageNet Classification through Label Progression , 2018, ArXiv.

[37] Mikhail Belkin,et al. Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[38] Kiyoharu Aizawa,et al. Joint Optimization Framework for Learning with Noisy Labels , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Bin Yang,et al. Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[40] Mikhail Belkin,et al. Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[41] Ran El-Yaniv,et al. SelectiveNet: A Deep Neural Network with an Integrated Reject Option , 2019, ICML.

[42] Bin Dong,et al. Distillation ≈ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network , 2019, ArXiv.

[43] J. Zico Kolter,et al. Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[44] Levent Sagun,et al. A jamming transition from under- to over-parametrization affects generalization in deep learning , 2018, Journal of Physics A: Mathematical and Theoretical.

[45] James Bailey,et al. Symmetric Cross Entropy for Robust Learning With Noisy Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[47] Thomas G. Dietterich,et al. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[48] Michael I. Jordan,et al. Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[49] Jeff A. Bilmes,et al. Combating Label Noise in Deep Learning Using Abstention , 2019, ICML.

[50] Levent Sagun,et al. The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[51] André Longtin,et al. Learning to generalize , 2019, eLife.

[52] Ruslan Salakhutdinov,et al. Deep Gamblers: Learning to Abstain with Portfolio Theory , 2019, NeurIPS.

[53] Mikhail Belkin,et al. Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[54] Quoc V. Le,et al. Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Samet Oymak,et al. Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[56] Andrew M. Saxe,et al. High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[57] Thomas Brox,et al. SELF: Learning to Filter Noisy Labels with Self-Ensembling , 2019, ICLR.

[58] Boaz Barak,et al. Deep double descent: where bigger models and more data hurt , 2019, ICLR.