Humanising GrabCut: Learning to segment humans using the Kinect

The Kinect provides an opportunity to collect large quantities of training data for visual learning algorithms relatively effortlessly. To this end we investigate learning to automatically segment humans from cluttered images (without depth information) given a bounding box. For this algorithm, obtaining a large dataset of images with segmented humans is crucial as it enables the possible variations in human appearances and backgrounds to be learnt. We show that a large dataset of roughly 3400 humans can be automatically acquired very cheaply using the Kinect. Segmenting humans is then cast as a learning problem with linear classifiers trained to predict segmentation masks from sparsely coded local HOG descriptors. These classifiers introduce top-down knowledge to obtain a crude segmentation of the human which is then refined using bottom up information from local color models in a Snap-Cut [2] like fashion. The method is quantitatively evaluated on images of humans in cluttered scenes, and a high performance obtained (88:5% overlap score). We also show that the method can be completely automated - segmenting humans given only the images, without requiring a bounding box, and compare with a previous state of the art method.

[1]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Shimon Ullman,et al.  Learning to Segment , 2004, ECCV.

[3]  Pieter Peers,et al.  SubEdit: a representation for editing measured heterogeneous subsurface scattering , 2009, SIGGRAPH 2009.

[4]  Michael J. Black,et al.  The Naked Truth: Estimating Body Shape Under Clothing , 2008, ECCV.

[5]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[6]  Andrew Zisserman,et al.  OBJCUT: Efficient Segmentation Using Top-Down and Bottom-Up Cues , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Pushmeet Kohli,et al.  Associative hierarchical CRFs for object class image segmentation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[8]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[9]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Andrew Zisserman,et al.  OBJ CUT , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Anat Levin,et al.  Learning to Combine Bottom-Up and Top-Down Segmentation , 2006, International Journal of Computer Vision.

[12]  Shimon Ullman,et al.  Class-Specific, Top-Down Segmentation , 2002, ECCV.

[13]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[14]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[15]  B. Schiele,et al.  Interleaved Object Categorization and Segmentation , 2003, BMVC.

[16]  Pushmeet Kohli,et al.  PoseCut: Simultaneous Segmentation and 3D Pose Estimation of Humans Using Dynamic Graph-Cuts , 2006, ECCV.

[17]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[19]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[20]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Guillermo Sapiro,et al.  Video SnapCut: robust video object cutout using localized classifiers , 2009, SIGGRAPH 2009.

[22]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.