Knowledge Distillation as Semiparametric Inference

A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. To explain and enhance this phenomenon, we cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate. By adapting modern semiparametric tools, we derive new guarantees for the prediction error of standard distillation and develop two enhancements—cross-fitting and loss correction—to mitigate the impact of teacher overfitting and underfitting on student performance. We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements.

[1]  Ed H. Chi,et al.  Understanding and Improving Knowledge Distillation , 2020, ArXiv.

[2]  James M. Robins,et al.  Double/De-Biased Machine Learning of Global and Local Parameters Using Regularized Riesz Representers , 2018 .

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Rémi Gribonval,et al.  And the Bit Goes Down: Revisiting the Quantization of Neural Networks , 2019, ICLR.

[5]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[6]  Jianping Gou,et al.  Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.

[7]  Vasilis Syrgkanis,et al.  Orthogonal Statistical Learning , 2019, The Annals of Statistics.

[8]  Ankit Singh Rawat,et al.  Why distillation helps: a statistical perspective , 2020, ArXiv.

[9]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[10]  Hisashi Kawai,et al.  Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification , 2018, INTERSPEECH.

[11]  Shaofeng Zou,et al.  Information-Theoretic Understanding of Population Risk Improvement with Model Compression , 2019, AAAI.

[12]  Zhi Jin,et al.  Distilling Word Embeddings: An Encoding Approach , 2015, CIKM.

[13]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[14]  Alexander J. Smola,et al.  Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation , 2020, NeurIPS.

[15]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[16]  Chong Wang,et al.  Model Distillation with Knowledge Transfer from Face Classification to Alignment and Verification , 2017, 1709.02929.

[17]  Quanshi Zhang,et al.  Explaining Knowledge Distillation by Quantifying the Knowledge , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Nan Yang,et al.  Attention-Guided Answer Distillation for Machine Reading Comprehension , 2018, EMNLP.

[19]  Jude W. Shavlik,et al.  in Advances in Neural Information Processing , 1996 .

[20]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[21]  E. Nadaraya On Estimating Regression , 1964 .

[22]  Jonathan Le Roux,et al.  Student-teacher network learning with enhanced features , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Junjie Yan,et al.  Mimicking Very Efficient Network for Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Michelle Guo,et al.  Knowledge distillation for small-footprint highway networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  G. Bennett Probability Inequalities for the Sum of Independent Random Variables , 1962 .

[26]  Philip S. Yu,et al.  Private Model Compression via Knowledge Distillation , 2018, AAAI.

[27]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[28]  Albert Gordo,et al.  Learning Global Additive Explanations for Neural Nets Using Model Distillation , 2018 .

[29]  Markus Freitag,et al.  Ensemble Distillation for Neural Machine Translation , 2017, ArXiv.

[30]  Rauf Izmailov,et al.  Learning using privileged information: similarity control and knowledge transfer , 2015, J. Mach. Learn. Res..

[31]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[33]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[34]  A E Bostwick,et al.  THE THEORY OF PROBABILITIES. , 1896, Science.

[35]  Yu Liu,et al.  MLBench: How Good Are Machine Learning Clouds for Binary Classification Tasks on Structured Data? , 2017 .

[36]  J. Robins,et al.  Locally Robust Semiparametric Estimation , 2016, Econometrica.

[37]  Christoph H. Lampert,et al.  Towards Understanding Knowledge Distillation , 2019, ICML.

[38]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[39]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[40]  Hossein Mobahi,et al.  Self-Distillation Amplifies Regularization in Hilbert Space , 2020, NeurIPS.

[41]  Sebastian U. Stich,et al.  Analysis of SGD with Biased Gradient Estimators , 2020, ArXiv.

[42]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[43]  Heiga Zen,et al.  Parallel WaveNet: Fast High-Fidelity Speech Synthesis , 2017, ICML.

[44]  Geoffrey E. Hinton,et al.  When Does Label Smoothing Help? , 2019, NeurIPS.

[45]  Jang Hyun Cho,et al.  On the Efficacy of Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  A. Tsiatis Semiparametric Theory and Missing Data , 2006 .

[47]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[48]  Ananthram Swami,et al.  Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks , 2015, 2016 IEEE Symposium on Security and Privacy (SP).

[49]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[50]  M. J. van der Laan,et al.  The International Journal of Biostatistics Targeted Maximum Likelihood Learning , 2011 .

[51]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[52]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[53]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Lester W. Mackey,et al.  Teacher-Student Compression with Generative Adversarial Networks , 2018, 1812.02271.

[55]  Leo Breiman,et al.  BORN AGAIN TREES , 1996 .

[56]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[57]  Jonathan Berant,et al.  White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks , 2019, NAACL.

[58]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[59]  Thad Starner,et al.  Data-Free Knowledge Distillation for Deep Neural Networks , 2017, ArXiv.

[60]  Vasilis Syrgkanis,et al.  Regularized Orthogonal Machine Learning for Nonlinear Semiparametric Models , 2018 .

[61]  Andrew Slavin Ross,et al.  Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients , 2017, AAAI.

[62]  Che-Rung Lee,et al.  Knowledge Distillation with Feature Maps for Image Classification , 2018, ACCV.

[63]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[64]  Micah Goldblum,et al.  Adversarially Robust Distillation , 2019, AAAI.

[65]  H. Robbins A Stochastic Approximation Method , 1951 .

[66]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[67]  Martín Abadi,et al.  Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[68]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[69]  Ndapandula Nakashole,et al.  Knowledge Distillation for Bilingual Dictionary Induction , 2017, EMNLP.