Self-distillation with Batch Knowledge Ensembling Improves ImageNet Classification

The recent studies of knowledge distillation [14, 28, 40, 47] have discovered that ensembling the “dark knowledge” from multiple teachers or students contributes to creating better soft targets for training, but at the cost of significantly more computations and/or parameters. In this work, we present BAtch Knowledge Ensembling (BAKE) to produce refined soft targets for anchor images by propagating and ensembling the knowledge of the other samples in the same mini-batch. Specifically, for each sample of interest, the propagation of knowledge is weighted in accordance with the inter-sample affinities, which are estimated on-the-fly with the current network. The propagated knowledge can then be ensembled to form a better soft target for distillation. In this way, our BAKE framework achieves online knowledge ensembling across multiple samples with only a single network. It requires minimal computational and memory overhead compared to existing knowledge ensembling methods. Extensive experiments demonstrate that the lightweight yet effective BAKE consistently boosts the classification performance of various architectures on multiple datasets, e.g., a significant +0.7% gain of Swin-T on ImageNet with only +1.5% computational overhead and zero additional parameters. BAKE does not only improve the vanilla baselines, but also surpasses the single-network state-of-the-arts [6, 52, 53, 56, 58] on all the benchmarks.

[1]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Cheng-Lin Liu,et al.  Data-Distortion Guided Self-Distillation for Deep Neural Networks , 2019, AAAI.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Jianbo Shi,et al.  Convolutional Random Walk Networks for Semantic Image Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[13]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[14]  Zhiqiang Shen,et al.  MEAL: Multi-Model Ensemble via Adversarial Learning , 2018, AAAI.

[15]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[16]  Harshad Rai,et al.  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .

[17]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[19]  Ali Farhadi,et al.  Label Refinery: Improving ImageNet Classification through Label Progression , 2018, ArXiv.

[20]  Haibo Wang,et al.  Self-supervising Fine-grained Region Similarities for Large-scale Image Localization , 2020, ECCV.

[21]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Wonjun Hwang,et al.  Densely Guided Knowledge Distillation using Multiple Teacher Assistants , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Kaisheng Ma,et al.  Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Jinwoo Shin,et al.  Regularizing Class-Wise Predictions via Self-Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Sangheum Hwang,et al.  Self-Knowledge Distillation with Progressive Refinement of Targets , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[28]  Xiaolin Hu,et al.  Online Knowledge Distillation via Collaborative Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[30]  Yan Lu,et al.  Relational Knowledge Distillation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Jiashi Feng,et al.  Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.

[34]  Yonglong Tian,et al.  Contrastive Representation Distillation , 2019, ICLR.

[35]  Hongsheng Li,et al.  Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID , 2020, NeurIPS.

[36]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[37]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[38]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[42]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[43]  Xiaohua Zhai,et al.  Are we done with ImageNet? , 2020, ArXiv.

[44]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[45]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Matthias Hein,et al.  Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks , 2020, ICML.

[48]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[49]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[50]  Fei-Fei Li,et al.  Novel Dataset for Fine-Grained Image Categorization : Stanford Dogs , 2012 .

[51]  Seong Joon Oh,et al.  Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Greg Mori,et al.  Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Hongsheng Li,et al.  DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Benjamin Recht,et al.  Evaluating Machine Accuracy on ImageNet , 2020, ICML.

[55]  Xiaogang Wang,et al.  FD-GAN: Pose-guided Feature Distilling GAN for Robust Person Re-identification , 2018, NeurIPS.

[56]  Michal Valko,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[57]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.