Teacher's pet: understanding and mitigating biases in distillation

Knowledge distillation is widely used as a means of improving the performance of a relatively simple “student” model using the predictions from a complex “teacher” model. Several works have shown that distillation significantly boosts the student’s overall performance; however, are these gains uniform across all data subgroups? In this paper, we show that distillation can harm performance on certain subgroups, e.g., classes with few associated samples. We trace this behaviour to errors made by the teacher distribution being transferred to and amplified by the student model. To mitigate this problem, we present techniques which soften the teacher influence for subgroups where it is less reliable. Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy, while additionally ensuring improvement in subgroup performance.

[1]  Percy Liang,et al.  Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , 2019, ArXiv.

[2]  Kaiming He,et al.  Data Distillation: Towards Omni-Supervised Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Toniann Pitassi,et al.  Fairness through awareness , 2011, ITCS '12.

[4]  Yang Song,et al.  The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Ed H. Chi,et al.  Understanding and Improving Knowledge Distillation , 2020, ArXiv.

[6]  Yang Song,et al.  Class-Balanced Loss Based on Effective Number of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Mert R. Sabuncu,et al.  Self-Distillation as Instance-Specific Label Smoothing , 2020, NeurIPS.

[8]  Mohit Singh,et al.  The Price of Fair PCA: One Extra Dimension , 2018, NeurIPS.

[9]  Tri Dao,et al.  Knowledge Distillation as Semiparametric Inference , 2021, ICLR.

[10]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[11]  Christopher Ré,et al.  No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems , 2020, NeurIPS.

[12]  Hongsheng Li,et al.  Balanced Meta-Softmax for Long-Tailed Visual Recognition , 2020, NeurIPS.

[13]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[14]  Chaoran Zhuge,et al.  Channel Distillation: Channel-Wise Attention for Knowledge Distillation , 2020, ArXiv.

[15]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[16]  Sanjay Chawla,et al.  On the Statistical Consistency of Algorithms for Binary Classification under Class Imbalance , 2013, ICML.

[17]  Mohammed Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Stella X. Yu,et al.  Large-Scale Long-Tailed Recognition in an Open World , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[20]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[21]  Kai Chen,et al.  Seesaw Loss for Long-Tailed Instance Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Zhanxing Zhu,et al.  Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher , 2020, NeurIPS.

[23]  Yuanzhi Li,et al.  Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning , 2020, International Conference on Learning Representations.

[24]  Toon Calders,et al.  Three naive Bayes approaches for discrimination-free classification , 2010, Data Mining and Knowledge Discovery.

[25]  Pietro Perona,et al.  The Devil is in the Tails: Fine-grained Classification in the Wild , 2017, ArXiv.

[26]  Percy Liang,et al.  An Investigation of Why Overparameterization Exacerbates Spurious Correlations , 2020, ICML.

[27]  Ankit Singh Rawat,et al.  Long-tail learning via logit adjustment , 2020, ICLR.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Shutao Xia,et al.  Maintaining Discrimination and Fairness in Class Incremental Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Marcus Rohrbach,et al.  Decoupling Representation and Classifier for Long-Tailed Recognition , 2020, ICLR.

[31]  Wei Zhang,et al.  Learning Efficient Detector with Semi-supervised Adaptive Distillation , 2019, BMVC.

[32]  Junsong Yuan,et al.  Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective , 2021, ICLR.

[33]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[34]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[35]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[36]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[37]  Hossein Mobahi,et al.  Self-Distillation Amplifies Regularization in Hilbert Space , 2020, NeurIPS.

[38]  Yichen Wei,et al.  Prime-Aware Adaptive Distillation , 2020, ECCV.

[39]  Jang Hyun Cho,et al.  On the Efficacy of Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[41]  Percy Liang,et al.  Selective Classification Can Magnify Disparities Across Groups , 2020, ICLR.

[42]  R. C. Williamson,et al.  Fairness risk measures , 2019, ICML.

[43]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[44]  Mehryar Mohri,et al.  Agnostic Federated Learning , 2019, ICML.

[45]  Luca Oneto,et al.  Leveraging Labeled and Unlabeled Data for Consistent Fair Binary Classification , 2019, NeurIPS.

[46]  Colin Wei,et al.  Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss , 2019, NeurIPS.