Distilling Double Descent

Distillation is the technique of training a “student” model based on examples that are labeled by a separate “teacher” model, which itself is trained on a labeled dataset. The most common explanations for why distillation “works” are predicated on the assumption that student is provided with soft labels, e.g. probabilities or confidences, from the teacher model. In this work, we show, that, even when the teacher model is highly overparameterized, and provides hard labels, using a very large held-out unlabeled dataset to train the student model can result in a model that outperforms more “traditional” approaches. Our explanation for this phenomenon is based on recent work on “double descent”. It has been observed that, once a model’s complexity roughly exceeds the amount required to memorize the training data, increasing the complexity further can, counterintuitively, result in better generalization. Researchers have identified several settings in which it takes place, while others have made various attempts to explain it (thus far, with only partial success). In contrast, we avoid these questions, and instead seek to exploit this phenomenon by demonstrating that a highly-overparameterized teacher can avoid overfitting via double descent, while a student trained on a larger independent dataset labeled by this teacher will avoid overfitting due to the size of its training set.

[1]  Kaiming He,et al.  Data Distillation: Towards Omni-Supervised Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Balas K. Natarajan,et al.  On learning sets and functions , 2004, Machine Learning.

[3]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[4]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[5]  Ed H. Chi,et al.  Understanding and Improving Knowledge Distillation , 2020, ArXiv.

[6]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[7]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[8]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[9]  Hossein Mobahi,et al.  Self-Distillation Amplifies Regularization in Hilbert Space , 2020, NeurIPS.

[10]  Junsong Yuan,et al.  Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective , 2021, ICLR.

[11]  Levent Sagun,et al.  The jamming transition as a paradigm to understand the loss landscape of deep neural networks , 2018, Physical review. E.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Shai Ben-David,et al.  Multiclass Learnability and the ERM principle , 2011, COLT.

[14]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[15]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[16]  Bin Dong,et al.  Distillation ≈ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network , 2019, ArXiv.

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[20]  Tomaso Poggio,et al.  Double descent in the condition number , 2019, ArXiv.

[21]  Ankit Singh Rawat,et al.  Why distillation helps: a statistical perspective , 2020, ArXiv.

[22]  Christos Thrampoulidis,et al.  Provable Benefits of Overparameterization in Model Compression: From Double Descent to Pruning Neural Networks , 2020, AAAI.

[23]  Mert R. Sabuncu,et al.  Self-Distillation as Instance-Specific Label Smoothing , 2020, NeurIPS.

[24]  Mikhail Belkin,et al.  Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..

[25]  Tomaso Poggio,et al.  Complexity control by gradient descent in deep networks , 2020, Nature Communications.

[26]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[27]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[28]  Frank Hutter,et al.  A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets , 2017, ArXiv.