R-FCN-3000 at 30fps: Decoupling Detection and Classification

We propose a modular approach towards large-scale real-time object detection by decoupling objectness detection and classification. We exploit the fact that many object classes are visually similar and share parts. Thus, a universal objectness detector can be learned for class-agnostic object detection followed by fine-grained classification using a (non)linear classifier. Our approach is a modification of the R-FCN architecture to learn shared filters for performing localization across different object classes. We trained a detector for 3000 object classes, called R-FCN-3000, that obtains an mAP of 34.9% on the ImageNet detection dataset. It outperforms YOLO-9000 by 18% while processing 30 images per second. We also show that the objectness learned by R-FCN-3000 generalizes to novel classes and the performance increases with the number of training object classes - supporting the hypothesis that it is possible to learn a universal objectness detector. Code will be made available.

[1]  Sergio Guadarrama,et al.  Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Samy Bengio,et al.  Large-Scale Object Classification Using Label Relation Graphs , 2014, ECCV.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Joshua B. Tenenbaum,et al.  Learning to share visual appearance for multiclass object detection , 2011, CVPR 2011.

[5]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[6]  Abhinav Gupta,et al.  Training Region-Based Object Detectors with Online Hard Example Mining , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yuxing Tang,et al.  Large Scale Semi-Supervised Object Detection Using Visual and Semantic Knowledge Transfer , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Trevor Darrell,et al.  LSDA: Large Scale Detection through Adaptation , 2014, NIPS.

[9]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[11]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Calvin C. Zhao Critical Review : Contour Detection and Hierarchical Image Segmentation , 2015 .

[13]  Antonio Torralba,et al.  Sharing features: efficient boosting procedures for multiclass object detection , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[14]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[15]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[18]  Yi Li,et al.  Fully Convolutional Instance-Aware Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[20]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[21]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[22]  Christopher D. Manning,et al.  An extended model of natural logic , 2009, IWCS.

[23]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Shih-Fu Chang,et al.  PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Shuicheng Yan,et al.  Dual Path Networks , 2017, NIPS.

[26]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[27]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Luc Van Gool,et al.  Weakly Supervised Cascaded Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[30]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Yi Li,et al.  R-FCN: Object Detection via Region-based Fully Convolutional Networks , 2016, NIPS.

[32]  Mark Everingham,et al.  Shared parts for deformable part-based models , 2011, CVPR 2011.

[33]  Vittorio Ferrari,et al.  Revisiting Knowledge Transfer for Training Object Class Detectors , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Andrew Zisserman,et al.  Detect to Track and Track to Detect , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Andrea Vedaldi,et al.  I Have Seen Enough: Transferring Parts Across Categories , 2016, BMVC.

[36]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[38]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[40]  Cristian Sminchisescu,et al.  Constrained parametric min-cuts for automatic object segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[41]  Trevor Darrell,et al.  Large Scale Visual Recognition through Adaptation using Joint Representation and Multiple Instance Learning , 2016, J. Mach. Learn. Res..