Spatial likelihood voting with self-knowledge distillation for weakly supervised object detection

Abstract Weakly supervised object detection (WSOD), which is an effective way to train an object detection model using only image-level annotations, has attracted considerable attention from researchers. However, most of the existing methods, which are based on multiple instance learning (MIL), tend to localize instances to the discriminative parts of salient objects instead of the entire content of all objects. In this paper, we propose a WSOD framework called the Spatial Likelihood Voting with Self-knowledge Distillation Network (SLV-SD Net). In this framework, we introduce a spatial likelihood voting (SLV) module to converge region proposal localization without bounding box annotations. Specifically, in every iteration during training, all the region proposals in a given image act as voters voting for the likelihood of each category in the spatial dimensions. After dilating the alignment on the area with large likelihood values, the voting results are regularized as bounding boxes, which are then used for the final classification and localization. Based on SLV, we further propose a self-knowledge distillation (SD) module to refine the feature representations of the given image. The likelihood maps generated by the SLV module are used to supervise the feature learning of the backbone network, encouraging the network to attend to wider and more diverse areas of the image. Extensive experiments on the PASCAL VOC 2007/2012 and MS-COCO datasets demonstrate the excellent performance of SLV-SD Net. In addition, SLV-SD Net produces new state-of-the-art results on these benchmarks.

[1]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Tinne Tuytelaars,et al.  Weakly supervised object detection with convex clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Qixiang Ye,et al.  Min-Entropy Latent Model for Weakly Supervised Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Wenyu Liu,et al.  Weakly Supervised Region Proposal Network and Object Detection , 2018, ECCV.

[7]  Yong Jae Lee,et al.  Instance-Aware, Context-Focused, and Memory-Efficient Weakly Supervised Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Andrea Vedaldi,et al.  Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Kaiqi Huang,et al.  Weakly Supervised Large Scale Object Localization with Multiple Instance Learning and Bag Splitting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[12]  Cordelia Schmid,et al.  Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[14]  Wenyu Liu,et al.  Deep patch learning for weakly supervised object classification and discovery , 2017, Pattern Recognit..

[15]  Yang Zou,et al.  Comprehensive Attention Self-Distillation for Weakly-Supervised Object Detection , 2020, NeurIPS.

[16]  Liujuan Cao,et al.  Cyclic Guidance for Weakly Supervised Joint Detection and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Wenyu Liu,et al.  PCL: Proposal Cluster Learning for Weakly Supervised Object Detection , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Jian Sun,et al.  Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[22]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Wengang Zhou,et al.  Instance Mining with Class Feature Banks for Weakly Supervised Object Detection , 2021, AAAI.

[24]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[28]  Jinjun Xiong,et al.  TS2C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection , 2018, ECCV.

[29]  Hongyang Chao,et al.  WSOD2: Learning Bottom-Up and Top-Down Objectness Distillation for Weakly-Supervised Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Yong Dou,et al.  Towards Precise End-to-End Weakly Supervised Object Detection Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[32]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[33]  Qi Tian,et al.  Zigzag Learning for Weakly Supervised Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Rongxin Jiang,et al.  SLV: Spatial Likelihood Voting for Weakly Supervised Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Shiguang Shan,et al.  Weakly Supervised Object Detection With Segmentation Collaboration , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Wenyu Liu,et al.  Multiple Instance Detection Network with Online Instance Classifier Refinement , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Larry S. Davis,et al.  C-WSL: Count-guided Weakly Supervised Localization , 2017, ECCV.

[38]  Kaisheng Ma,et al.  Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  C. V. Jawahar,et al.  Dissimilarity Coefficient Based Weakly Supervised Object Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ivan Laptev,et al.  ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization , 2016, ECCV.

[41]  Chen Change Loy,et al.  Learning Lightweight Lane Detection CNNs by Self Attention Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[43]  Dongrui Fan,et al.  Utilizing the Instability in Weakly Supervised Object Detection , 2019, CVPR Workshops.

[44]  Bernard Ghanem,et al.  W2F: A Weakly-Supervised to Fully-Supervised Framework for Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[46]  Chang Liu,et al.  C-MIL: Continuation Multiple Instance Learning for Weakly Supervised Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[48]  Luc Van Gool,et al.  Weakly Supervised Cascaded Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[50]  Alan L. Yuille,et al.  Snapshot Distillation: Teacher-Student Optimization in One Generation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).