From rigid templates to grammars: object detection with structured models

We develop models for localizing instances of a generic object category, such as cars or people, in images. We define these models using a grammar formalism. In this formalism compositional rules are used to encode models that can range in complexity from simple rigid templates to rich deformable part models with variable structure. A central contribution of this dissertation is an exploration along this axis, wherein we gradually enrich our object category representations. We demonstrate that these richer models lead to improved object detection performance on challenging datasets such as the PASCAL VOC Challenges. While building richer models, we would like to make use of existing training data and annotations. These annotations typically specify labels, such as object bounding boxes, that are "weak'' compared to the derivation trees produced by detection with a grammar model. We propose a new discriminative training framework that directly supports learning models from weakly-labeled examples. We show how to apply this framework to the problem of learning the parameters of a grammar model. This approach results in a top-performing method for detecting people in images. In order to achieve widespread use in research and applications, an object detection system must not only be accurate, but also fast. Along the line of efficient computation, we develop a technique for "compiling" one of our object models into a much faster detector that implements a cascade architecture. We show how to select the cascade thresholds in a way that is both safe and effective. We demonstrate that the cascaded detector produces detections 15x faster than the non-cascade approach with no loss in precision or recall.

[1]  William T. Freeman,et al.  Orientation Histograms for Hand Gesture Recognition , 1995 .

[2]  Kazuo Kyuma,et al.  Computer vision for computer games , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[3]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[4]  Dariu Gavrila,et al.  The Visual Analysis of Human Movement: A Survey , 1999, Comput. Vis. Image Underst..

[5]  Dariu Gavrila,et al.  Real-time object detection for "smart" vehicles , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[6]  Daniel P. Huttenlocher,et al.  Efficient matching of pictorial structures , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[7]  Serge J. Belongie,et al.  Matching shapes , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[8]  Tomaso A. Poggio,et al.  Example-Based Object Detection in Images by Components , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Cordelia Schmid,et al.  Learning to Parse Pictures of People , 2002, ECCV.

[10]  Luc Van Gool,et al.  Efficient pedestrian detection : a test case for SVM based categorization , 2002 .

[11]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  David A. Forsyth,et al.  Probabilistic Methods for Finding People , 2001, International Journal of Computer Vision.

[13]  Takeo Kanade,et al.  Object Detection Using the Statistics of Parts , 2004, International Journal of Computer Vision.

[14]  D.M. Gavrila,et al.  Vision-based pedestrian detection: the PROTECTOR system , 2004, IEEE Intelligent Vehicles Symposium, 2004.

[15]  Yan Ke,et al.  PCA-SIFT: a more distinctive representation for local image descriptors , 2004, CVPR 2004.

[16]  Cordelia Schmid,et al.  Scale & Affine Invariant Interest Point Detectors , 2004, International Journal of Computer Vision.

[17]  Tomaso A. Poggio,et al.  A Trainable System for Object Detection , 2000, International Journal of Computer Vision.

[18]  Cordelia Schmid,et al.  Human Detection Based on a Probabilistic Assembly of Robust Part Detectors , 2004, ECCV.

[19]  E. L. Schwartz,et al.  Spatial mapping in the primate sensory projection: Analytic structure and relevance to perception , 1977, Biological Cybernetics.

[20]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .