Prioritized training on points that are learnable, worth learning, and not yet learned

We introduce Goldilocks Selection, a technique for faster model training which selects a sequence of training points that are “just right”. We propose an information-theoretic acquisition function— the reducible validation loss—and compute it with a small proxy model—GoldiProx—to efficiently choose training points that maximize information about the labels of a validation set. We show that the “hard” (e.g. high loss) points usually selected in the optimization literature are typically noisy, while the “easy” (e.g. low noise) samples often prioritized for curriculum learning confer less information. Further, points with uncertain labels, typically targeted by active learning, tend to be less relevant to the task. In contrast, GoldiProx Selection chooses points that are “just right” and empirically outperforms the above approaches. Moreover, the selected sequence can transfer to other architectures; practitioners can share and reuse it without the need to recreate it.

[1]  Frank Hutter,et al.  Online Batch Selection for Faster Training of Neural Networks , 2015, ArXiv.

[2]  Jeff A. Bilmes,et al.  Combating Label Noise in Deep Learning Using Abstention , 2019, ICML.

[3]  L'eon Bottou,et al.  Cold Case: The Lost MNIST Digits , 2019, NeurIPS.

[4]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[5]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Tom Rainforth,et al.  Active Learning under Pool Set Distribution Shift and Noisy Data , 2021, ArXiv.

[8]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[9]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[10]  Aäron van den Oord,et al.  Divide and Contrast: Self-supervised Learning from Uncurated Data , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[12]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[13]  François Fleuret,et al.  Not All Samples Are Created Equal: Deep Learning with Importance Sampling , 2018, ICML.

[14]  A. Kiureghian,et al.  Aleatory or epistemic? Does it matter? , 2009 .

[15]  Baharan Mirzasoleiman,et al.  Selection Via Proxy: Efficient Data Selection For Deep Learning , 2019, ICLR.

[16]  Amos J. Storkey,et al.  School of Informatics, University of Edinburgh , 2022 .

[17]  Yarin Gal,et al.  BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning , 2019, NeurIPS.

[18]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[19]  Zoubin Ghahramani,et al.  Bayesian Active Learning for Classification and Preference Learning , 2011, ArXiv.

[20]  Haihao Lu,et al.  Ordered SGD: A New Stochastic Optimization Framework for Empirical Risk Minimization , 2020, AISTATS.

[21]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[22]  Tom Rainforth,et al.  On Statistical Bias In Active Learning: How and When To Fix It , 2021, ICLR.

[23]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[24]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[25]  Zoubin Ghahramani,et al.  Deep Bayesian Active Learning with Image Data , 2017, ICML.