Exploring Bottom-Up and Top-Down Cues With Attentive Learning for Webly Supervised Object Detection

Fully supervised object detection has achieved great success in recent years. However, abundant bounding boxes annotations are needed for training a detector for novel classes. To reduce the human labeling effort, we propose a novel webly supervised object detection (WebSOD) method for novel classes which only requires the web images without further annotations. Our proposed method combines bottom-up and top-down cues for novel class detection. Within our approach, we introduce a bottom-up mechanism based on the well-trained fully supervised object detector (i.e. Faster RCNN) as an object region estimator for web images by recognizing the common objectiveness shared by base and novel classes. With the estimated regions on the web images, we then utilize the top-down attention cues as the guidance for region classification. Furthermore, we propose a residual feature refinement (RFR) block to tackle the domain mismatch between web domain and the target domain. We demonstrate our proposed method on PASCAL VOC dataset with three different novel/base splits. Without any target-domain novel-class images and annotations, our proposed webly supervised object detection model is able to achieve promising performance for novel classes. Moreover, we also conduct transfer learning experiments on large scale ILSVRC 2013 detection dataset and achieve state-of-the-art performance.

[1]  Jianfei Cai,et al.  Keypoint Based Weakly Supervised Human Parsing , 2018, Image Vis. Comput..

[2]  Kaiqi Huang,et al.  Mixed Supervised Object Detection with Robust Objectness Transfer , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Jianfei Cai,et al.  Decoupled Spatial Neural Attention for Weakly Supervised Semantic Segmentation , 2018, IEEE Transactions on Multimedia.

[4]  Simon See,et al.  Improving Deep Lesion Detection Using 3D Contextual and Spatial Attention , 2019, MICCAI.

[5]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[6]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[8]  Trevor Darrell,et al.  LSDA: Large Scale Detection through Adaptation , 2014, NIPS.

[9]  Yunchao Wei,et al.  STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.

[11]  Shih-Fu Chang,et al.  Deep Transfer Network: Unsupervised Domain Adaptation , 2015, ArXiv.

[12]  Trevor Darrell,et al.  Simultaneous Deep Transfer Across Domains and Tasks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Yong Jae Lee,et al.  Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Mei Wang,et al.  Deep Visual Domain Adaptation: A Survey , 2018, Neurocomputing.

[15]  Hao Chen,et al.  Detecting 11K Classes: Large Scale Object Detection Without Fine-Grained Bounding Boxes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Jianfei Cai,et al.  Exploiting Web Images for Weakly Supervised Object Detection , 2019, IEEE Transactions on Multimedia.

[17]  Wenyu Liu,et al.  Multiple Instance Detection Network with Online Instance Classifier Refinement , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Hongyang Chao,et al.  WSOD2: Learning Bottom-Up and Top-Down Objectness Distillation for Weakly-Supervised Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Gabriela Csurka,et al.  Deep Visual Domain Adaptation , 2020, 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).

[20]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Yong Jae Lee,et al.  DOCK: Detecting Objects by Transferring Common-Sense Knowledge , 2018, ECCV.

[22]  Yuxing Tang,et al.  Large Scale Semi-Supervised Object Detection Using Visual and Semantic Knowledge Transfer , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Luc Van Gool,et al.  Weakly Supervised Cascaded Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Ivor W. Tsang,et al.  Heterogeneous Domain Adaptation for Multiple Classes , 2014, AISTATS.

[25]  Qi Tian,et al.  Zigzag Learning for Weakly Supervised Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Tao Xiang,et al.  Transfer Learning by Ranking for Weakly Supervised Object Annotation , 2017, BMVC.

[27]  Cordelia Schmid,et al.  Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Wei Liu,et al.  Deep Self-Taught Learning for Weakly Supervised Object Localization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Andrea Vedaldi,et al.  Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Hyun Seung Yang,et al.  SSPP-DAN: Deep domain adaptation network for face recognition with single sample per person , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[31]  Namil Kim,et al.  Pixel-Level Domain Transfer , 2016, ECCV.

[32]  Jianfei Cai,et al.  M2E-Try On Net: Fashion from Model to Everyone , 2018, ACM Multimedia.

[33]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Jianfei Cai,et al.  Zero-Annotation Object Detection with Web Knowledge Transfer , 2017, ECCV.

[35]  Ian D. Reid,et al.  Bootstrapping the Performance of Webly Supervised Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Vittorio Ferrari,et al.  Revisiting Knowledge Transfer for Training Object Class Detectors , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Jinjun Xiong,et al.  TS2C: Tight Box Mining with Surrounding Segmentation Context for Weakly Supervised Object Detection , 2018, ECCV.

[38]  T. Tuytelaars,et al.  Weakly Supervised Object Detection with Posterior Regularization , 2014 .

[39]  Xin Wang,et al.  Few-Shot Object Detection via Feature Reweighting , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).