Teach-DETR: Better Training DETR with Teachers

In this paper, we present a novel training scheme, namely Teach-DETR, to learn better DETR-based detectors from versatile teacher detectors. We show that the predicted boxes from teacher detectors are effective medium to transfer knowledge of teacher detectors, which could be either RCNN-based or DETR-based detectors, to train a more accurate and robust DETR model. This new training scheme can easily incorporate the predicted boxes from multiple teacher detectors, each of which provides parallel supervisions to the student DETR. Our strategy introduces no additional parameters and adds negligible computational cost to the original detector during training. During inference, Teach-DETR brings zero additional overhead and maintains the merit of requiring no non-maximum suppression. Extensive experiments show that our method leads to consistent improvement for various DETR-based detectors. Specifically, we improve the state-of-the-art detector DINO with Swin-Large backbone, 4 scales of feature maps and 36-epoch training schedule, from 57.8% to 58.9% in terms of mean average precision on MSCOCO 2017 validation set. Code will be available at https://github.com/LeonHLJ/Teach-DETR.

[1]  Xiao-pei Wu,et al.  DETRs with Hybrid Matching , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xiangyu Zhang,et al.  Anchor DETR: Query Design for Transformer-Based Detector , 2022, AAAI.

[3]  H. Shum,et al.  DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , 2022, ICLR.

[4]  L. Ni,et al.  DN-DETR: Accelerate DETR Training by Introducing Query DeNoising , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Hang Su,et al.  DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR , 2022, ICLR.

[6]  Yujie Wang,et al.  Knowledge Distillation for Object Detection via Rank Mimicking and Prediction-guided Feature Imitation , 2021, AAAI.

[7]  Yuan Gong,et al.  Focal and Global Knowledge Distillation for Detectors , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Deqing Sun,et al.  ViDT: An Efficient and Effective Fully Transformer-based Object Detector , 2021, ICLR.

[9]  Ming-Ming Cheng,et al.  Localization Distillation for Dense Object Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Gang Zeng,et al.  Group DETR: Fast Training Convergence with Decoupled One-to-Many Label Assignment , 2022, ArXiv.

[11]  Yunji Chen,et al.  Distilling Object Detectors with Feature Richness , 2021, NeurIPS.

[12]  Zhenguo Li,et al.  G-DetKD: Towards General Distillation Framework for Object Detectors via Contrastive and Semantic-guided Feature Imitation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Depu Meng,et al.  Conditional DETR for Fast Training Convergence , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Jiyang Qi,et al.  You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection , 2021, NeurIPS.

[15]  Kai Han,et al.  Distilling Object Detectors via Decoupled Features , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Erjin Zhou,et al.  General Instance Distillation for Object Detection , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Peng Gao,et al.  Fast Convergence of DETR with Spatially Modulated Co-Attention , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[19]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[20]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Linfeng Zhang,et al.  Improve Object Detection with Feature-based Knowledge Distillation: Towards Accurate and Efficient Detectors , 2021, ICLR.

[22]  Xiaopeng Zhang,et al.  Distilling Object Detectors with Task Adaptive Regularization , 2020, ArXiv.

[23]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[24]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[26]  Ross B. Girshick,et al.  LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jiashi Feng,et al.  Distilling Object Detectors With Fine-Grained Feature Imitation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Tony X. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[30]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Junjie Yan,et al.  Mimicking Very Efficient Network for Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[38]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[39]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[40]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.