Pose2Instance: Harnessing Keypoints for Person Instance Segmentation

Human keypoints are a well-studied representation of people.We explore how to use keypoint models to improve instance-level person segmentation. The main idea is to harness the notion of a distance transform of oracle provided keypoints or estimated keypoint heatmaps as a prior for person instance segmentation task within a deep neural network. For training and evaluation, we consider all those images from COCO where both instance segmentation and human keypoints annotations are available. We first show how oracle keypoints can boost the performance of existing human segmentation model during inference without any training. Next, we propose a framework to directly learn a deep instance segmentation model conditioned on human pose. Experimental results show that at various Intersection Over Union (IOU) thresholds, in a constrained environment with oracle keypoints, the instance segmentation accuracy achieves 10% to 12% relative improvements over a strong baseline of oracle bounding boxes. In a more realistic environment, without the oracle keypoints, the proposed deep person instance segmentation model conditioned on human pose achieves 3.8% to 10.5% relative improvements comparing with its strongest baseline of a deep network trained only for segmentation.

[1]  Philip H. S. Torr,et al.  Bottom-up Instance Segmentation using Deep Higher-Order CRFs , 2016, BMVC.

[2]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[3]  Alexander C. Berg,et al.  Learning to decompose for object detection and instance segmentation , 2015, ArXiv.

[4]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[5]  Seunghoon Hong,et al.  Joint Segmentation and Pose Tracking of Human in Natural Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Yunchao Wei,et al.  Proposal-Free Network for Instance-Level Object Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[8]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[9]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Song-Chun Zhu,et al.  Integrating Grammar and Segmentation for Human Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Wolfram Burgard,et al.  Deep learning for human part discovery in images , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Jitendra Malik,et al.  Iterative Instance Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Stefan Hougardy,et al.  The Floyd-Warshall algorithm on graphs with negative cycles , 2010, Inf. Process. Lett..

[14]  Alexander M. Bronstein,et al.  Parallel algorithms for approximation of distance maps on parametric surfaces , 2008, TOGS.

[15]  Jian Sun,et al.  ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yi Li,et al.  Instance-Sensitive Fully Convolutional Networks , 2016, ECCV.

[17]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[18]  Vittorio Ferrari,et al.  We Are Family: Joint Pose Estimation of Multiple Persons , 2010, ECCV.

[19]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Navdeep Jaitly,et al.  Chained Predictions Using Convolutional Neural Networks , 2016, ECCV.

[21]  Yunchao Wei,et al.  Reversible Recursive Instance-Level Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Renjie Liao,et al.  Semantic Segmentation with Object Clique Potential , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Pushmeet Kohli,et al.  Simultaneous Segmentation and Pose Estimation of Humans Using Dynamic Graph Cuts , 2008, International Journal of Computer Vision.

[25]  Ivan Laptev,et al.  Pose Estimation and Segmentation of People in 3D Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Philip H. S. Torr,et al.  Recurrent Instance Segmentation , 2015, ECCV.

[27]  Sergio Escalera,et al.  GrabCut-Based Human Segmentation in Video Sequences , 2012, Sensors.

[28]  Sanja Fidler,et al.  Instance-Level Segmentation with Deep Densely Connected MRFs , 2015, ArXiv.

[29]  Esa Rahtu,et al.  Real-time Human Pose Estimation from Video with Convolutional Neural Networks , 2016, ArXiv.

[30]  Soumith Chintala,et al.  A MultiPath Network for Object Detection , 2016, BMVC.

[31]  Chun Chen,et al.  A survey of human pose estimation: The body parts parsing based methods , 2015, J. Vis. Commun. Image Represent..