Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

[1]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[2]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[3]  R. Vaillant,et al.  An original approach for the localization of objects in images , 1993 .

[4]  R. Vaillant,et al.  Original approach for the localisation of objects in images , 1994 .

[5]  Tomaso A. Poggio,et al.  Example-Based Learning for View-Based Human Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[9]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[10]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[11]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[12]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[13]  Joseph J. Lim,et al.  Recognition using regions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  Cordelia Schmid,et al.  Evaluation of GIST descriptors for web-scale image search , 2009, CIVR '09.

[16]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[18]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[19]  Andrew Zisserman,et al.  Multiple kernels for object detection , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Derek Hoiem,et al.  Category Independent Object Proposals , 2010, ECCV.

[22]  David A. McAllester,et al.  Object Detection with Grammar Models , 2011, NIPS.

[23]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[24]  Tieniu Tan,et al.  Boosted local structured HOG-LBP for object localization , 2011, CVPR 2011.

[25]  Graham W. Taylor,et al.  Adaptive deconvolutional networks for mid and high level feature learning , 2011, 2011 International Conference on Computer Vision.

[26]  Jitendra Malik,et al.  Semantic segmentation using regions and parts , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Thomas S. Huang,et al.  A data driven method for feature transformation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[29]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Charless C. Fowlkes,et al.  Do We Need More Training Data or Better Models for Object Detection? , 2012, BMVC.

[31]  Derek Hoiem,et al.  Diagnosing Error in Object Detectors , 2012, ECCV.

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Cristian Sminchisescu,et al.  CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[35]  Sanja Fidler,et al.  Bottom-Up Segmentation for Top-Down Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Deva Ramanan,et al.  Histograms of Sparse Codes for Object Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[40]  Luca Maria Gambardella,et al.  Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks , 2013, MICCAI.

[41]  Joseph J. Lim,et al.  Sketch Tokens: A Learned Mid-level Representation for Contour and Object Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Dumitru Erhan,et al.  Deep Neural Networks for Object Detection , 2013, NIPS.

[43]  Jonathon Shlens,et al.  Fast, Accurate Detection of 100,000 Object Classes on a Single Machine , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Antonio Torralba,et al.  HOGgles: Visualizing Object Detection Features , 2013, 2013 IEEE International Conference on Computer Vision.

[45]  Yann LeCun,et al.  Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Ming Yang,et al.  Regionlets for Generic Object Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[47]  R. Fergus,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[48]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Michael S. Bernstein,et al.  Scalable multi-label annotation , 2014, CHI.

[50]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[51]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.