Regionlets for Generic Object Detection

Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to evaluate for many locations. In view of this, we propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as region lets. A region let is a base feature extraction region defined proportionally to a detection window at an arbitrary resolution (i.e. size and aspect ratio). These region lets are organized in small groups with stable relative positions to delineate fine grained spatial layouts inside objects. Their features are aggregated to a one-dimensional feature within one group so as to tolerate deformations. Then we evaluate the object bounding box proposal in selective search from segmentation cues, limiting the evaluation locations to thousands. Our approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. It achieves the detection mean average precision of 41.7% on the PASCAL VOC 2007 dataset and 39.7% on the VOC 2010 for 20 object categories. It achieves 14.7% mean average precision on the Image Net dataset for 200 object categories, outperforming the latest deformable part-based model (DPM) by 4.7%.

[1]  Fatih Murat Porikli,et al.  Human Detection via Classification on Riemannian Manifolds , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Thomas Deselaers,et al.  What is an object? , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Matthew B. Blaschko,et al.  Learning a category independent object detection cascade , 2011, 2011 International Conference on Computer Vision.

[5]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[6]  Tsuhan Chen,et al.  Extracting adaptive contextual cues from unlabeled regions , 2011, 2011 International Conference on Computer Vision.

[7]  Iasonas Kokkinos,et al.  Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound , 2011, NIPS.

[8]  Jonathan Warrell,et al.  Proposal generation for object detection using cascaded ranking SVMs , 2011, CVPR 2011.

[9]  Cordelia Schmid,et al.  Combining efficient object localization and image classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Cristian Sminchisescu,et al.  Constrained parametric min-cuts for automatic object segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Cordelia Schmid,et al.  Segmentation Driven Object Detection with Fisher Vectors , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Jordi Gonzàlez,et al.  A coarse-to-fine approach for fast deformable object detection , 2011, CVPR 2011.

[14]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[15]  Christoph H. Lampert An efficient divide-and-conquer cascade for nonlinear object detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  Koen E. A. van de Sande,et al.  Segmentation as selective search for object recognition , 2011, 2011 International Conference on Computer Vision.

[17]  Matti Pietikäinen,et al.  Face Recognition with Local Binary Patterns , 2004, ECCV.

[18]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[20]  Qiang Chen,et al.  Contextualizing Object Detection and Classification , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Sanja Fidler,et al.  Bottom-Up Segmentation for Top-Down Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[23]  Piotr Dollár,et al.  Crosstalk Cascades for Frame-Rate Pedestrian Detection , 2012, ECCV.

[24]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[25]  Jing Xiao,et al.  Contextual boost for pedestrian detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[27]  Tieniu Tan,et al.  Boosted local structured HOG-LBP for object localization , 2011, CVPR 2011.

[28]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[29]  Luc Van Gool,et al.  Pedestrian detection at 100 frames per second , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Fatih Murat Porikli,et al.  Integral histogram: a fast way to extract histograms in Cartesian spaces , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[31]  Pietro Perona,et al.  Pedestrian detection: A benchmark , 2009, CVPR.

[32]  Subhransu Maji,et al.  Detecting People Using Mutually Consistent Poselet Activations , 2010, ECCV.

[33]  Wenyu Liu,et al.  Feature context for image classification and object detection , 2011, CVPR 2011.

[34]  Shihong Lao,et al.  Boosting nested cascade detector for multi-view face detection , 2004, ICPR 2004.

[35]  Stan Sclaroff,et al.  Contextual Object Detection Using Set-Based Classification , 2012, ECCV.

[36]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[37]  Cristian Sminchisescu,et al.  Object recognition as ranking holistic figure-ground hypotheses , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Shuicheng Yan,et al.  An HOG-LBP human detector with partial occlusion handling , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[40]  William T. Freeman,et al.  Latent hierarchical structural learning for object detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[41]  Dumitru Erhan,et al.  Scalable Object Detection Using Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Antonio Criminisi,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[43]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[45]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.