Dynamic Structured Model Selection

In many cases, the predictive power of structured models for for complex vision tasks is limited by a trade-off between the expressiveness and the computational tractability of the model. However, choosing this trade-off statically a priori is sub optimal, as images and videos in different settings vary tremendously in complexity. On the other hand, choosing the trade-off dynamically requires knowledge about the accuracy of different structured models on any given example. In this work, we propose a novel two-tier architecture that provides dynamic speed/accuracy trade-offs through a simple type of introspection. Our approach, which we call dynamic structured model selection (DMS), leverages typically intractable features in structured learning problems in order to automatically determine' which of several models should be used at test-time in order to maximize accuracy under a fixed budgetary constraint. We demonstrate DMS on two sequential modeling vision tasks, and we establish a new state-of-the-art in human pose estimation in video with an implementation that is roughly 23× faster than the previous standard implementation.

[1]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[2]  Xavier Rodet,et al.  Dynamic model selection for spectral voice conversion , 2010, INTERSPEECH.

[3]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[4]  Trevor Darrell,et al.  Timely Object Recognition , 2012, NIPS.

[5]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[6]  He He,et al.  Imitation Learning by Coaching , 2012, NIPS.

[7]  Ben Taskar,et al.  MODEC: Multimodal Decomposable Models for Human Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  C. V. Jawahar,et al.  Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators , 2012, ECCV.

[9]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[10]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[11]  Ben Taskar,et al.  Parsing human motion with stretchable models , 2011, CVPR 2011.

[12]  Shipeng Yu,et al.  Designing efficient cascaded classifiers: tradeoff between accuracy and cost , 2010, KDD.

[13]  Andrew Zisserman,et al.  Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Daphne Koller,et al.  Active Classification based on Value of Classifier , 2011, NIPS.

[15]  Andrew Zisserman,et al.  Upper Body Detection and Tracking in Extended Signing Sequences , 2011, International Journal of Computer Vision.

[16]  Adam R. Teichert,et al.  Learned Prioritization for Trading Off Accuracy and Speed , 2012, NIPS.

[17]  Kilian Q. Weinberger,et al.  Classifier Cascade for Minimizing Feature Evaluation Cost , 2012, AISTATS.

[18]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[19]  Venkatesh Saligrama,et al.  Supervised Sequential Classification Under Budget Constraints , 2013, AISTATS.

[20]  Deva Ramanan,et al.  N-best maximal decoders for part models , 2011, 2011 International Conference on Computer Vision.

[21]  Shishir K. Shah,et al.  Joint Modeling of Algorithm Behavior and Image Quality for Algorithm Performance Prediction , 2010, BMVC.

[22]  J. Andrew Bagnell,et al.  SpeedBoost: Anytime Prediction with Uniform Near-Optimality , 2012, AISTATS.

[23]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[24]  Ben Taskar,et al.  Structured Prediction Cascades , 2010, AISTATS.

[25]  Lise Getoor,et al.  Dynamic Processing Allocation in Video , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.