论文信息 - UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

Object detection with transformers (DETR) reaches competitive performance with Faster R-CNN via a transformer encoder-decoder architecture. Inspired by the great success of pre-training transformers in natural language processing, we propose a pretext task named random query patch detection to unsupervisedly pre-train DETR (UP-DETR) for object detection. Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the original image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade-off multi-task learning of classification and localization in the pretext task, we freeze the CNN backbone and propose a patch feature reconstruction branch which is jointly optimized with patch detection. (2) To perform multi-query localization, we introduce UP-DETR from single-query patch and extend it to multi-query patches with object query shuffle and attention mask. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher precision on PASCAL VOC and COCO datasets. The code will be available soon.

[1] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Wei Wu,et al. High Performance Visual Tracking with Siamese Region Proposal Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4] Stella X. Yu,et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[7] Shih-Fu Chang,et al. Unsupervised Embedding Learning via Invariant and Spreading Instance Feature , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[9] Yi Li,et al. Fully Convolutional Instance-Aware Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[11] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[12] Carsten Rother,et al. Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Xingyi Zhou,et al. Objects as Points , 2019, ArXiv.

[14] Matthijs Douze,et al. Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[15] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[16] Yue Zhang,et al. Comparison Network for One-Shot Conditional Object Detection , 2019, ArXiv.

[17] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[18] Andrea Vedaldi,et al. Self-labelling via simultaneous clustering and representation learning , 2020, ICLR.

[19] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21] Hei Law,et al. CornerNet: Detecting Objects as Paired Keypoints , 2018, ECCV.

[22] Bin Liu,et al. Parametric Instance Classification for Unsupervised Visual Feature Learning , 2020, NeurIPS.

[23] Xiang Li,et al. One-Shot Object Detection without Fine-Tuning , 2020, ArXiv.

[24] Hwann-Tzong Chen,et al. One-Shot Object Detection with Co-Attention and Co-Excitation , 2019, NeurIPS.

[25] Julien Mairal,et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[26] Hao Chen,et al. FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27] Michal Valko,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[28] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[29] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[31] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[32] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Nuno Vasconcelos,et al. Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34] Kaiming He,et al. Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[35] Haichao Zhang,et al. Towards Adversarially Robust Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Junjie Yan,et al. Grid R-CNN , 2018, 1811.12030.

[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[38] Luca Bertinetto,et al. Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[39] Andrew Y. Ng,et al. End-to-End People Detection in Crowded Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[41] Guanglu Song,et al. Revisiting the Sibling Head in Object Detector , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Shifeng Zhang,et al. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[44] Kaiming He,et al. Rethinking ImageNet Pre-Training , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[46] Ross B. Girshick,et al. Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47] Min Bai,et al. UPSNet: A Unified Panoptic Segmentation Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Lu Yuan,et al. Rethinking Classification and Localization for Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.