Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection

Localization Quality Estimation (LQE) is crucial and popular in the recent advancement of dense object detectors since it can provide accurate ranking scores that benefit the Non-Maximum Suppression processing and improve detection performance. As a common practice, most existing methods predict LQE scores through vanilla convolutional features shared with object classification or bounding box regression. In this paper, we explore a completely novel and different perspective to perform LQE – based on the learned distributions of the four parameters of the bounding box. The bounding box distributions are inspired and introduced as "General Distribution" in GFLV1, which describes the uncertainty of the predicted bounding boxes well. Such a property makes the distribution statistics of a bounding box highly correlated to its real localization quality. Specifically, a bounding box distribution with a sharp peak usually corresponds to high localization quality, and vice versa. By leveraging the close correlation between distribution statistics and the real localization quality, we develop a considerably lightweight Distribution-Guided Quality Predictor (DGQP) for reliable LQE based on GFLV1, thus producing GFLV2. To our best knowledge, it is the first attempt in object detection to use a highly relevant, statistical representation to facilitate LQE. Extensive experiments demonstrate the effectiveness of our method. Notably, GFLV2 (ResNet101) achieves 46.2 AP at 14.6 FPS, surpassing the previous state-of-the-art ATSS baseline (43.6 AP at 14.6 FPS) by absolute 2.6 AP on COCO test-dev, without sacrificing the efficiency both in training and inference.

[1]  Qixiang Ye,et al.  FreeAnchor: Learning to Match Anchors for Visual Object Detection , 2019, NeurIPS.

[2]  Ying Wang,et al.  VarifocalNet: An IoU-aware Dense Object Detector , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Kai Zhao,et al.  Res2Net: A New Multi-Scale Backbone Architecture , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jian Sun,et al.  BorderDet: Border Feature for Dense Object Detection , 2020, ECCV.

[5]  Zheng Zhang,et al.  RepPoints V2: Verification Meets Regression for Object Detection , 2020, NeurIPS.

[6]  Hee Seok Lee,et al.  Probabilistic Anchor Assignment with IoU Prediction for Object Detection , 2020, ECCV.

[7]  Jian Sun,et al.  AutoAssign: Differentiable Label Assignment for Dense Object Detection , 2020, ArXiv.

[8]  Jun Li,et al.  Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection , 2020, NeurIPS.

[9]  Fei Wang,et al.  CentripetalNet: Pursuing High-Quality Keypoint Pairs for Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Guanglu Song,et al.  Revisiting the Sibling Head in Object Detector , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Xinggang Wang,et al.  IoU-aware Single-stage Object Detector for Accurate Localization , 2019, Image Vis. Comput..

[12]  Kai Chen,et al.  Side-Aware Boundary Localization for More Precise Object Detection , 2019, ECCV.

[13]  Shifeng Zhang,et al.  Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Zhiqiang Shen,et al.  Soft Anchor-Point Object Detection , 2019, ECCV.

[15]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[17]  Xuelong Li,et al.  Hierarchical Shot Detector , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[19]  Stephen Lin,et al.  RepPoints: Point Set Representation for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Qi Tian,et al.  CenterNet: Keypoint Triplets for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[22]  Yuning Jiang,et al.  FoveaBox: Beyond Anchor-based Object Detector , 2019, ArXiv.

[23]  Huajun Feng,et al.  Libra R-CNN: Towards Balanced Learning for Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Marios Savvides,et al.  Feature Selective Anchor-Free Module for Single-Shot Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yongchao Gong,et al.  Mask Scoring R-CNN , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Zhaoxiang Zhang,et al.  Scale-Aware Trident Networks for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Junjie Yan,et al.  Grid R-CNN , 2018, 1811.12030.

[29]  Stephen Lin,et al.  Deformable ConvNets V2: More Deformable, Better Results , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Hei Law,et al.  CornerNet: Detecting Objects as Paired Keypoints , 2018, International Journal of Computer Vision.

[31]  Yuning Jiang,et al.  Acquisition of Localization Confidence for Accurate Object Detection , 2018, ECCV.

[32]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[33]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Lars Petersson,et al.  Improving Object Localization with Fitness NMS and Bounded IoU Loss , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[42]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[45]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[46]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.