Distillation ≈ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network

Distillation is a method to transfer knowledge from one model to another and often achieves higher accuracy with the same capacity. In this paper, we aim to provide a theoretical understanding on what mainly helps with the distillation. Our answer is "early stopping". Assuming that the teacher network is overparameterized, we argue that the teacher network is essentially harvesting dark knowledge from the data via early stopping. This can be justified by a new concept, {Anisotropic Information Retrieval (AIR)}, which means that the neural network tends to fit the informative information first and the non-informative information (including noise) later. Motivated by the recent development on theoretically analyzing overparameterized neural networks, we can characterize AIR by the eigenspace of the Neural Tangent Kernel(NTK). AIR facilities a new understanding of distillation. With that, we further utilize distillation to refine noisy labels. We propose a self-distillation algorithm to sequentially distill knowledge from the network in the previous training epoch to avoid memorizing the wrong labels. We also demonstrate, both theoretically and empirically, that self-distillation can benefit from more than just early stopping. Theoretically, we prove convergence of the proposed algorithm to the ground truth labels for randomly initialized overparameterized neural networks in terms of $\ell_2$ distance, while the previous result was on convergence in $0$-$1$ loss. The theoretical result ensures the learned neural network enjoy a margin on the training data which leads to better generalization. Empirically, we achieve better testing accuracy and entirely avoid early stopping which makes the algorithm more user-friendly.

[1]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[2]  Xingrui Yu,et al.  How Does Disagreement Benefit Co-teaching? , 2019 .

[3]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[4]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[5]  Zhiyuan Li,et al.  Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee , 2019, ICLR.

[6]  Ali Farhadi,et al.  Label Refinery: Improving ImageNet Classification through Label Progression , 2018, ArXiv.

[7]  Yoshua Bengio,et al.  On the Spectral Bias of Deep Neural Networks , 2018, ArXiv.

[8]  Mert R. Sabuncu,et al.  Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , 2018, NeurIPS.

[9]  Stanley Osher,et al.  Iterative Regularization and Nonlinear Inverse Scale Space Applied to Wavelet-Based Denoising , 2007, IEEE Transactions on Image Processing.

[10]  Naiyan Wang,et al.  Like What You Like: Knowledge Distill via Neuron Selectivity Transfer , 2017, ArXiv.

[11]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[12]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[13]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[14]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[15]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[16]  Kiyoharu Aizawa,et al.  Joint Optimization Framework for Learning with Noisy Labels , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[18]  Stefano Soatto,et al.  SaaS: Speed as a Supervisor for Semi-supervised Learning , 2018, ECCV.

[19]  Alan L. Yuille,et al.  Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students , 2018, ArXiv.

[20]  Lorenzo Rosasco,et al.  Learning from Examples as an Inverse Problem , 2005, J. Mach. Learn. Res..

[21]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[22]  S. Osher,et al.  Nonlinear inverse scale space methods , 2006 .

[23]  E. Weinan,et al.  A Priori Estimates of the Population Risk for Residual Networks , 2019, ArXiv.

[24]  Jinchao Xu,et al.  Iterative Methods by Space Decomposition and Subspace Correction , 1992, SIAM Rev..

[25]  Jianing Shi,et al.  A Nonlinear Inverse Scale Space Method for a Convex Multiplicative Noise Model , 2008, SIAM J. Imaging Sci..

[26]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[27]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[28]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[29]  Xingrui Yu,et al.  How does Disagreement Help Generalization against Label Corruption? , 2019, ICML.

[30]  Nicolas Courty,et al.  Wasserstein Adversarial Regularization (WAR) on label noise , 2019 .

[31]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[32]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[33]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[34]  Bo Ren,et al.  Fluid directed rigid body control using deep reinforcement learning , 2018, ACM Trans. Graph..

[35]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[36]  Samet Oymak,et al.  Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[37]  Zheng Ma,et al.  Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , 2019, Communications in Computational Physics.

[38]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[39]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[40]  James Bailey,et al.  Dimensionality-Driven Learning with Noisy Labels , 2018, ICML.

[41]  Guy Gilboa,et al.  Nonlinear Inverse Scale Space Methods for Image Restoration , 2005, VLSM.

[42]  S. Smale,et al.  Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[43]  Nicolas Courty,et al.  Pushing the right boundaries matters! Wasserstein Adversarial Training for Label Noise , 2019, ArXiv.

[44]  Otmar Scherzer,et al.  Inverse Scale Space Theory for Inverse Problems , 2001, Scale-Space.

[45]  Nir Shavit,et al.  Deep Learning is Robust to Massive Label Noise , 2017, ArXiv.

[46]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[47]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[48]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[49]  Razvan Pascanu,et al.  Sobolev Training for Neural Networks , 2017, NIPS.

[50]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[51]  Xingrui Yu,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[52]  Alan L. Yuille,et al.  Snapshot Distillation: Teacher-Student Optimization in One Generation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Anima Anandkumar,et al.  Learning From Noisy Singly-labeled Data , 2017, ICLR.

[54]  Wei Hu,et al.  Understanding Generalization of Deep Neural Networks Trained with Noisy Labels , 2019, ArXiv.