论文信息 - Knowledge Distillation as Semiparametric Inference

Knowledge Distillation as Semiparametric Inference

A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. To explain and enhance this phenomenon, we cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate. By adapting modern semiparametric tools, we derive new guarantees for the prediction error of standard distillation and develop two enhancements—cross-fitting and loss correction—to mitigate the impact of teacher overfitting and underfitting on student performance. We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements.

[1] Ed H. Chi,et al. Understanding and Improving Knowledge Distillation , 2020, ArXiv.

[2] James M. Robins,et al. Double/De-Biased Machine Learning of Global and Local Parameters Using Regularized Riesz Representers , 2018 .

[3] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Rémi Gribonval,et al. And the Bit Goes Down: Revisiting the Quantization of Neural Networks , 2019, ICLR.

[5] Jonathon Shlens,et al. Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[6] Jianping Gou,et al. Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.

[7] Vasilis Syrgkanis,et al. Orthogonal Statistical Learning , 2019, The Annals of Statistics.

[8] Ankit Singh Rawat,et al. Why distillation helps: a statistical perspective , 2020, ArXiv.

[9] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[10] Hisashi Kawai,et al. Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification , 2018, INTERSPEECH.

[11] Shaofeng Zou,et al. Information-Theoretic Understanding of Population Risk Improvement with Model Compression , 2019, AAAI.

[12] Zhi Jin,et al. Distilling Word Embeddings: An Encoding Approach , 2015, CIKM.

[13] Rich Caruana,et al. Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[14] Alexander J. Smola,et al. Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation , 2020, NeurIPS.

[15] Mikhail Belkin,et al. Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[16] Chong Wang,et al. Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification , 2017, 1709.02929.

[17] Quanshi Zhang,et al. Explaining Knowledge Distillation by Quantifying the Knowledge , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Nan Yang,et al. Attention-Guided Answer Distillation for Machine Reading Comprehension , 2018, EMNLP.

[19] Jude W. Shavlik,et al. in Advances in Neural Information Processing , 1996 .

[20] G. S. Watson,et al. Smooth regression analysis , 1964 .

[21] E. Nadaraya. On Estimating Regression , 1964 .

[22] Jonathan Le Roux,et al. Student-teacher network learning with enhanced features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Junjie Yan,et al. Mimicking Very Efficient Network for Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Michelle Guo,et al. Knowledge distillation for small-footprint highway networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] G. Bennett. Probability Inequalities for the Sum of Independent Random Variables , 1962 .

[26] Philip S. Yu,et al. Private Model Compression via Knowledge Distillation , 2018, AAAI.

[27] J. Robins,et al. Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[28] Albert Gordo,et al. Learning Global Additive Explanations for Neural Nets Using Model Distillation , 2018 .

[29] Markus Freitag,et al. Ensemble Distillation for Neural Machine Translation , 2017, ArXiv.

[30] Rauf Izmailov,et al. Learning using privileged information: similarity control and knowledge transfer , 2015, J. Mach. Learn. Res..

[31] Derek Hoiem,et al. Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[33] M. Kosorok. Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[34] A E Bostwick,et al. THE THEORY OF PROBABILITIES. , 1896, Science.

[35] Yu Liu,et al. MLBench: How Good Are Machine Learning Clouds for Binary Classification Tasks on Structured Data? , 2017 .

[36] J. Robins,et al. Locally Robust Semiparametric Estimation , 2016, Econometrica.