DURING DEEP NEURAL NETWORK LEARNING

Inspired by the phenomenon of catastrophic forgetting, we investigate the learning dynamics of neural networks as they train on single classification tasks. Our goal is to understand whether a related phenomenon occurs when data does not undergo a clear distributional shift. We define a “forgetting event” to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set’s (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.

[1]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[2]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[3]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[4]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[5]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[6]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[7]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[8]  Yong Jae Lee,et al.  Learning the easy things first: Self-paced visual category discovery , 2011, CVPR 2011.

[9]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[10]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[11]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[14]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[15]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[16]  Tao Qin,et al.  Learning What Data to Learn , 2017, ArXiv.

[17]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[18]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[19]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[20]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[22]  Andrew McCallum,et al.  Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples , 2017, NIPS.

[23]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[24]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[25]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[26]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[27]  Huan Wang,et al.  Identifying Generalization Properties in Neural Networks , 2018, ArXiv.

[28]  Yoshua Bengio,et al.  On the Learning Dynamics of Deep Neural Networks , 2018, ArXiv.

[29]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[30]  Jonghyun Choi,et al.  ScreenerNet: Learning Curriculum for Neural Networks , 2018, ArXiv.

[31]  François Fleuret,et al.  Not All Samples Are Created Equal: Deep Learning with Importance Sampling , 2018, ICML.

[32]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[33]  Yi Zhou,et al.  Convergence of SGD in Learning ReLU Models with Separable Data , 2018, ArXiv.

[34]  David Barber,et al.  Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting , 2018, NeurIPS.

[35]  Jason Yosinski,et al.  Measuring the Intrinsic Dimension of Objective Landscapes , 2018, ICLR.

[36]  Chico Q. Camargo,et al.  Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[37]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.