Max Margin Learning of Hierarchical Configural Deformable Templates (HCDTs) for Efficient Object Parsing and Pose Estimation

In this paper we formulate a hierarchical configurable deformable template (HCDT) to model articulated visual objects—such as horses and baseball players—for tasks such as parsing, segmentation, and pose estimation. HCDTs represent an object by an AND/OR graph where the OR nodes act as switches which enables the graph topology to vary adaptively. This hierarchical representation is compositional and the node variables represent positions and properties of subparts of the object. The graph and the node variables are required to obey the summarization principle which enables an efficient compositional inference algorithm to rapidly estimate the state of the HCDT. We specify the structure of the AND/OR graph of the HCDT by hand and learn the model parameters discriminatively by extending Max-Margin learning to AND/OR graphs. We illustrate the three main aspects of HCDTs—representation, inference, and learning—on the tasks of segmenting, parsing, and pose (configuration) estimation for horses and humans. We demonstrate that the inference algorithm is fast and that max-margin learning is effective. We show that HCDTs gives state of the art results for segmentation and pose estimation when compared to other methods on benchmarked datasets.

[1]  Jianbo Shi,et al.  Bottom-up Recognition and Parsing of the Human Body , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Long Zhu,et al.  Rapid Inference on a Novel AND/OR graph for Object Detection, Segmentation and Parsing , 2007, NIPS.

[3]  Michael J. Black,et al.  Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4]  Jitendra Malik,et al.  Shape Guided Object Segmentation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Greg Mori,et al.  Guiding model search using segmentation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Stuart Geman,et al.  Context and Hierarchy in a Probabilistic Image Model , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  James M. Coughlan,et al.  Finding Deformable Shapes Using Loopy Belief Propagation , 2002, ECCV.

[8]  Jianbo Shi,et al.  Recognizing objects by piecing together the Segmentation Puzzle , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  B. Schiele,et al.  Combined Object Categorization and Segmentation With an Implicit Shape Model , 2004 .

[10]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[11]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[12]  Long Zhu,et al.  Unsupervised Structure Learning: Hierarchical Recursive Composition, Suspicious Coincidence and Competitive Exclusion , 2008, ECCV.

[13]  Vladimir Kolmogorov,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[14]  Jitendra Malik,et al.  Recovering human body configurations using pairwise constraints between parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[15]  Paul A. Viola,et al.  Multiple Instance Boosting for Object Detection , 2005, NIPS.

[16]  M. Lee,et al.  Proposal maps driven MCMC for estimating human body pose in static images , 2004, CVPR 2004.

[17]  Tomaso A. Poggio,et al.  Pedestrian detection using wavelet templates , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Nebojsa Jojic,et al.  LOCUS: learning object classes with unsupervised segmentation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[19]  Andrew Zisserman,et al.  OBJ CUT , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Daniel Snow,et al.  Efficient Deformable Template Detection and Localization without User Initialization , 2000, Comput. Vis. Image Underst..

[21]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[22]  Rina Dechter,et al.  AND/OR search spaces for graphical models , 2007, Artif. Intell..

[23]  Jitendra Malik,et al.  Cue Integration for Figure/Ground Labeling , 2005, NIPS.

[24]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[25]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  Hong Chen,et al.  Composite Templates for Cloth Modeling and Sketching , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[27]  Long Zhu,et al.  Unsupervised Learning of Probabilistic Grammar-Markov Models for Object Categories , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[29]  Long Zhu,et al.  A Hierarchical Compositional System for Rapid Object Detection , 2005, NIPS.

[30]  Yifei Lu,et al.  Max Margin AND/OR Graph learning for parsing the human body , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[32]  Anand Rangarajan,et al.  A new algorithm for non-rigid point matching , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[33]  Daniel Snow,et al.  Efficient optimization of a deformable template using dynamic programming , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[34]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[35]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[36]  Shimon Ullman,et al.  Class-Specific, Top-Down Segmentation , 2002, ECCV.

[37]  Jiebo Luo,et al.  Body Localization in Still Images Using Hierarchical Models and Hybrid Search , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[38]  Anat Levin,et al.  Learning to Combine Bottom-Up and Top-Down Segmentation , 2006, ECCV.

[39]  Long Zhu,et al.  Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing , 2006, NIPS.

[40]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[41]  John C. Platt Using Analytic QP and Sparseness to Speed Training of Support Vector Machines , 1998, NIPS.

[42]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[43]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[45]  Alan L. Yuille,et al.  A Time-Efficient Cascade for Real-Time Object Detection: With applications for the visually impaired , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[46]  Jitendra Malik,et al.  Recovering human body configurations: combining segmentation and recognition , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[47]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[48]  Cordelia Schmid,et al.  Learning to Parse Pictures of People , 2002, ECCV.

[49]  Deva Ramanan,et al.  Learning to parse images of articulated bodies , 2006, NIPS.

[50]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..