Deep Learning on a Data Diet: Finding Important Examples Early in Training

The recent success of deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for generalization, and how do we find them? In this work, we make the striking observation that, on standard vision benchmarks, the initial loss gradient norm of individual training examples, averaged over several weight initializations, can be used to identify a smaller set of training data that is important for generalization. Furthermore, after only a few epochs of training, the information in gradient norms is reflected in the normed error–L2 distance between the predicted probabilities and one hot labels–which can be used to prune a significant fraction of the dataset without sacrificing test accuracy. Based on this, we propose data pruning methods which use only local information early in training, and connect them to recent work that prunes data by discarding examples that are rarely forgotten over the course of training. Our methods also shed light on how the underlying data distribution shapes the training dynamics: they rank examples based on their importance for generalization, detect noisy examples and identify subspaces of the model’s data representation that are relatively stable over training.

[1]  Vitaly Feldman,et al.  What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation , 2020, NeurIPS.

[2]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Baharan Mirzasoleiman,et al.  Coresets for Robust Training of Neural Networks against Noisy Labels , 2020, ArXiv.

[5]  Baharan Mirzasoleiman,et al.  Selection Via Proxy: Efficient Data Selection For Deep Learning , 2019, ICLR.

[6]  Gintare Karolina Dziugaite,et al.  Linear Mode Connectivity and the Lottery Ticket Hypothesis , 2019, ICML.

[7]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[8]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[9]  David P. Woodruff,et al.  On Coresets for Logistic Regression , 2018, NeurIPS.

[10]  Gintare Karolina Dziugaite,et al.  RelatIF: Identifying Explanatory Training Examples via Relative Influence , 2020, ArXiv.

[11]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[12]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[13]  Dan Feldman,et al.  Coresets For Monotonic Functions with Applications to Deep Learning , 2018, ArXiv.

[14]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[15]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[16]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[17]  Anima Anandkumar,et al.  Deep Active Learning for Named Entity Recognition , 2017, Rep4NLP@ACL.

[18]  Baharan Mirzasoleiman,et al.  Coresets for Data-efficient Training of Machine Learning Models , 2019, ICML.

[19]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[20]  Myunggwon Hwang,et al.  Data Distribution Search to Select Core-Set for Machine Learning , 2020, SMA.

[21]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[22]  Amos J. Storkey,et al.  School of Informatics, University of Edinburgh , 2022 .

[23]  Rishabh K. Iyer,et al.  Submodularity in Data Subset Selection and Active Learning , 2015, ICML.

[24]  Vitaly Feldman,et al.  Does learning require memorization? a short tale about a long tail , 2019, STOC.

[25]  Frederick Liu,et al.  Estimating Training Data Influence by Tracking Gradient Descent , 2020, NeurIPS.

[26]  Trevor Campbell,et al.  Coresets for Scalable Bayesian Logistic Regression , 2016, NIPS.

[27]  Ruosong Wang,et al.  Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks , 2019, ICLR.

[28]  Trevor Campbell,et al.  Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent , 2018, ICML.

[29]  Surya Ganguli,et al.  Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel , 2020, NeurIPS.

[30]  Silvio Savarese,et al.  Active Learning for Convolutional Neural Networks: A Core-Set Approach , 2017, ICLR.

[31]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[32]  Ganesh Ramakrishnan,et al.  GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning , 2021, AAAI.