Simultaneous Human Segmentation, Depth and Pose Estimation via Dual Decomposition

The tasks of stereo matching, segmentation, and human pose estimation have been popular in computer vision in recent years, but attempts to combine the three tasks have so far resulted in compromises: either using infra-red cameras, or a greatly simplified body model. We propose a framework for estimating a detailed human skeleton in 3D from a stereo pair of images. Within this framework, we define an energy function that incorporates the relationship between the segmentation results, the pose estimation results, and the disparity space image. Specifically, we codify the assertions that foreground pixels should relate to some body part, should correspond to a continuous surface in the disparity space image, and should be closer to the camera than the surrounding background pixels. Our energy function is NP-hard, however we show how to efficiently optimize a relaxation of it using dual decomposition. We show that applying this approach leads to improved results in all three tasks, and also introduce an extensive and challenging new dataset, which we use as a benchmark for evaluating 3D human pose estimation.

[1]  Wei Niu,et al.  Human activity detection and recognition for video surveillance , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[2]  Andrew Blake,et al.  Efficient Dense Stereo with Occlusions for New View-Synthesis by Four-State Dynamic Programming , 2006, International Journal of Computer Vision.

[3]  Vladimir Kolmogorov,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[4]  Stephen P. Boyd,et al.  Notes on Decomposition Methods , 2008 .

[5]  Alexander Zelinsky,et al.  An algorithm for real-time stereo vision implementation of head pose and gaze direction measurement , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[6]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[7]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Luca Iocchi,et al.  Human Posture Tracking and Classification through Stereo Vision and 3D Model Matching , 2008, EURASIP J. Image Video Process..

[9]  Stephen Gould,et al.  A Unified Contour-Pixel Model for Figure-Ground Segmentation , 2010, ECCV.

[10]  Pushmeet Kohli,et al.  PoseCut: Simultaneous Segmentation and 3D Pose Estimation of Humans Using Dynamic Graph-Cuts , 2006, ECCV.

[11]  VekslerOlga,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001 .

[12]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, CVPR.

[13]  Daphne Koller,et al.  Multi-level inference by relaxed dual decomposition for human pose segmentation , 2011, CVPR 2011.

[14]  Olga Veksler,et al.  Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Andrew Blake,et al.  Bi-layer segmentation of binocular stereo video , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[16]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[17]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Vladimir Kolmogorov,et al.  Interactive Foreground Extraction using graph cut , 2011 .

[20]  Andrew Zisserman,et al.  OBJCUT: Efficient Segmentation Using Top-Down and Bottom-Up Cues , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Dariu Gavrila,et al.  Multi-cue pedestrian classification with partial occlusion handling , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Nikos Komodakis,et al.  MRF Energy Minimization and Beyond via Dual Decomposition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Andrew Zisserman,et al.  Efficient discriminative learning of parts-based models , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[24]  Carsten Rother,et al.  Fast cost-volume filtering for visual correspondence and beyond , 2011, CVPR 2011.

[25]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[26]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .