Beyond neural scaling laws: beating power law scaling via data pruning

Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how in theory we can break beyond power law scaling and potentially even reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this improved scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling in practice on ResNets trained on CIFAR-10, SVHN, and ImageNet. Next, given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.

[1]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[2]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[3]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[4]  Luca M. Schulze Buschoff,et al.  Trivial or impossible - dichotomous data difficulty masks model differences (on ImageNet and beyond) , 2021, ICLR.

[5]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  C. Farabet,et al.  Training Data Subset Search With Ensemble Active Learning , 2019, IEEE Transactions on Intelligent Transportation Systems.

[7]  Utkarsh Sharma Scaling Laws from the Data Manifold Dimension , 2022, J. Mach. Learn. Res..

[8]  Wojciech Czaja,et al.  Active Learning at the ImageNet Scale , 2021, ArXiv.

[9]  Jonathan S. Rosenfeld Scaling Laws for Deep Learning , 2021, ArXiv.

[10]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[11]  G. Dziugaite,et al.  Deep Learning on a Data Diet: Finding Important Examples Early in Training , 2021, NeurIPS.

[12]  Yair Carmon,et al.  Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization , 2021, ICML.

[13]  Li Fei-Fei,et al.  Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering , 2021, ACL.

[14]  M. Bethge,et al.  Partial success in closing the gap between human and machine vision , 2021, NeurIPS.

[15]  Andrea Vedaldi,et al.  PASS: An ImageNet replacement for self-supervised pretraining without humans , 2021, NeurIPS Datasets and Benchmarks.

[16]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[17]  Jaehoon Lee,et al.  Explaining neural scaling laws , 2021, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Tom Henighan,et al.  Scaling Laws for Transfer , 2021, ArXiv.

[19]  Stefano Soatto,et al.  Estimating informativeness of samples with Smooth Unique Information , 2021, ICLR.

[20]  Prafulla Dhariwal,et al.  Data and Parameter Scaling Laws for Neural Machine Translation , 2021 .

[21]  Mark Chen,et al.  Scaling Laws for Autoregressive Generative Modeling , 2020, ArXiv.

[22]  Vitaly Feldman,et al.  What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation , 2020, NeurIPS.

[23]  Felix A. Wichmann,et al.  Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency , 2020, NeurIPS.

[24]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[25]  Surya Ganguli,et al.  Statistical Mechanics of Deep Learning , 2020, Annual Review of Condensed Matter Physics.

[26]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[27]  Fei-Fei Li,et al.  Towards fairer datasets: filtering and balancing the distribution of the people subtree in the ImageNet hierarchy , 2019, FAT*.

[28]  Luca Saglietti,et al.  Large deviations for the perceptron model and consequences for active learning , 2019, MSML.

[29]  Jonathan S. Rosenfeld,et al.  A Constructive Prediction of the Generalization Error Across Scales , 2019, ICLR.

[30]  Baharan Mirzasoleiman,et al.  Coresets for Data-efficient Training of Machine Learning Models , 2019, ICML.

[31]  Eric P. Xing,et al.  Learning Robust Global Representations by Penalizing Local Predictive Power , 2019, NeurIPS.

[32]  Taghi M. Khoshgoftaar,et al.  Survey on deep learning with class imbalance , 2019, J. Big Data.

[33]  Hai-Jun Zhou,et al.  Active online learning in the binary perceptron problem , 2019, Communications in Theoretical Physics.

[34]  Hossein Mobahi,et al.  Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need , 2019, ArXiv.

[35]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[36]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[37]  Matthias Bethge,et al.  Generalisation in humans and deep neural networks , 2018, NeurIPS.

[38]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[39]  Silvio Savarese,et al.  Active Learning for Convolutional Neural Networks: A Core-Set Approach , 2017, ICLR.

[40]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[41]  Matthias Bethge,et al.  Methods and measurements to compare men against machines , 2017, HVEI.

[42]  Florent Krzakala,et al.  Statistical physics of inference: thresholds and algorithms , 2015, ArXiv.

[43]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[45]  S. Ganguli,et al.  Statistical mechanics of complex neural systems and high dimensional data , 2013, 1301.7115.

[46]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[47]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[48]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[49]  H. Sebastian Seung,et al.  Information, Prediction, and Query by Committee , 1992, NIPS.

[50]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[51]  E. Gardner The space of interactions in neural network models , 1988 .

[52]  M. Mézard,et al.  Spin Glass Theory and Beyond , 1987 .