Structural SVM for visual localization and continuous state estimation

We present an integrated model for visual object localization and continuous state estimation in a discriminative structured prediction framework. While existing discriminative ‘prediction through time’ methods have showed remarkable versatility for visual reconstruction and tracking problems, they tend to assume that the input is known (or the object is segmented) a condition that can rarely be accommodated in images of real scenes. Our structural Support Vector Machine (structSVM) framework offers an end-to-end training and inference framework that overcomes these limitations by consistently searching both in the space of possible inputs (effectively an efficient form of object localization) and in the space of possible structured outputs, given those inputs. We demonstrate the potential of this methodology for 3d human pose reconstruction in monocular images both in the HumanEva benchmark, where 3d ground truth is available, and qualitatively, in un-instrumented images of real scenes.1

[1]  Cristian Sminchisescu,et al.  Hyperdynamics Importance Sampling , 2002, ECCV.

[2]  Bernhard Schölkopf,et al.  Joint Kernel Maps , 2005, IWANN.

[3]  Rómer Rosales,et al.  Learning Body Pose via Specialized Maps , 2001, NIPS.

[4]  Michael J. Black,et al.  Predicting 3D People from 2D Pictures , 2006, AMDO.

[5]  R. Cook Regression Graphics , 1994 .

[6]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[7]  Vladimir Pavlovic,et al.  Dimensionality reduction using covariance operator inverse regression , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Andrew Blake,et al.  Articulated body motion capture by annealed particle filtering , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[9]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[10]  Christoph H. Lampert,et al.  Beyond sliding windows: Object localization by efficient subwindow search , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  David Maxwell Chickering,et al.  Dependency Networks for Inference, Collaborative Filtering, and Data Visualization , 2000, J. Mach. Learn. Res..

[13]  Thorsten Joachims,et al.  Training structural SVMs when exact inference is intractable , 2008, ICML '08.

[14]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[15]  Jason Weston,et al.  A general regression technique for learning transductions , 2005, ICML '05.

[16]  Y. Bar-Shalom Tracking and data association , 1988 .

[17]  Alexei A. Efros,et al.  Putting Objects in Perspective , 2006, CVPR.

[18]  Stefan Roth,et al.  People-tracking-by-detection and people-detection-by-tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Christoph H. Lampert,et al.  Learning to Localize Objects with Structured Output Regression , 2008, ECCV.

[20]  Wojciech Matusik,et al.  Practical motion capture in everyday surroundings , 2007, SIGGRAPH 2007.

[21]  David J. Fleet,et al.  The Kneed Walker for human pose tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  David J. Fleet,et al.  People tracking using hybrid Monte Carlo filtering , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[23]  Cristian Sminchisescu,et al.  Learning Joint Top-Down and Bottom-up Processes for 3D Visual Inference , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[24]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion , 2006 .

[25]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[26]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[27]  Liefeng Bo,et al.  Structured output-associative regression , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Luc Van Gool,et al.  A mobile vision system for robust multi-person tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Cristian Sminchisescu,et al.  Training Deformable Models for Localization , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[30]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  David J. Fleet,et al.  Priors for people tracking from small training sets , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.