Shape and Pose Estimation for Closely Interacting Persons Using Multi‐view Images

Multi‐person pose and shape estimation is very challenging, especially when the persons have close interactions. Existing methods only work well when people are well spaced out in the captured images. However, close interaction among people is very common in real life, which is more challenge due to complex articulation, frequent occlusion and inherent ambiguities. We present a fully‐automatic markerless motion capture method to simultaneously estimate 3D poses and shapes of closely interacting people from multi‐view sequences. We first predict the 2D joints for each person in an image, and then design a spatio‐temporal tracker for multi‐person pose tracking based on multi‐view videos. Finally, we estimate 3D poses and shapes of all the persons with multi‐view constraints using a skinned multi‐person linear model (SMPL). Experimental results demonstrate that our method achieves fast but accurate pose and shape estimation results for multi‐person close interaction cases. Compared with existing methods, our method does not need pre‐segmentation for each person and manual intervention, which greatly reduces the complexity of the system including time complexity and system processing complexity.

[1]  Pascal Fua,et al.  Articulated Soft Objects for Multiview Shape and Motion Capture , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Ian D. Reid,et al.  Articulated Body Motion Capture by Stochastic Search , 2005, International Journal of Computer Vision.

[3]  Hans-Peter Seidel,et al.  Optimization and Filtering for Human Motion Capture , 2010, International Journal of Computer Vision.

[4]  Hans-Peter Seidel,et al.  Performance capture from sparse multi-view video , 2008, ACM Trans. Graph..

[5]  Wojciech Matusik,et al.  Articulated mesh animation from multi-view silhouettes , 2008, ACM Trans. Graph..

[6]  Michael J. Black,et al.  The Naked Truth: Estimating Body Shape Under Clothing , 2008, ECCV.

[7]  Wojciech Matusik,et al.  Articulated mesh animation from multi-view silhouettes , 2008, ACM Trans. Graph..

[8]  Nancy Argüelles,et al.  Author ' s , 2008 .

[9]  Hans-Peter Seidel,et al.  Motion capture using joint skeleton tracking and surface estimation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Hans-Peter Seidel,et al.  Multilinear pose and body shape estimation of dressed subjects from image sets , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Slobodan Ilic,et al.  Free-form mesh tracking: A patch-based approach , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Luc Van Gool,et al.  Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities , 2011, NIPS.

[13]  Michael Isard,et al.  Loose-limbed People: Estimating 3D Human Pose and Motion Using Non-parametric Belief Propagation , 2011, International Journal of Computer Vision.

[14]  Hans-Peter Seidel,et al.  Markerless motion capture of interacting characters using multi-view image segmentation , 2011, CVPR 2011.

[15]  Gary R. Bradski,et al.  ORB: An efficient alternative to SIFT or SURF , 2011, 2011 International Conference on Computer Vision.

[16]  Kun Li,et al.  Markerless Shape and Motion Capture From Multiview Video Sequences , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  Qionghai Dai,et al.  Performance Capture of Interacting Characters with Handheld Kinects , 2012, ECCV.

[18]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[19]  Christian Theobalt,et al.  Full Body Performance Capture under Uncontrolled and Varying Illumination: A Shading-Based Approach , 2012, ECCV.

[20]  Christian Theobalt,et al.  On-set performance capture of multiple actors with a stereo camera , 2013, ACM Trans. Graph..

[21]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Bernt Schiele,et al.  Multi-view Pictorial Structures for 3D Human Pose Estimation , 2013, BMVC.

[23]  Hans-Peter Seidel,et al.  Markerless Motion Capture of Multiple Characters Using Multiview Image Segmentation , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Cristian Sminchisescu,et al.  Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[26]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[27]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Silvio Savarese,et al.  Learning to Track: Online Multi-object Tracking by Decision Making , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[30]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, ICCV.

[32]  James M. Rehg,et al.  Multiple Hypothesis Tracking Revisited , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Jean-Yves Guillemaut,et al.  General Dynamic Scene Reconstruction from Multiple View Video , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Michael J. Black,et al.  SMPL , 2015 .

[36]  Wongun Choi,et al.  Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Mohan S. Kankanhalli,et al.  Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps , 2016, ECCV.

[38]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[41]  Adrian Hilton,et al.  Deep Convolutional Networks for Marker-less Human Pose Estimation from Multiple Views , 2016, CVMP 2016.

[42]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Christian Theobalt,et al.  Monocular 3D Human Pose Estimation Using Transfer Learning and Improved CNN Supervision , 2016, ArXiv.

[44]  Vincent Lepetit,et al.  Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[45]  Bernt Schiele,et al.  DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model , 2016, ECCV.

[46]  Fabio Tozeto Ramos,et al.  Simple online and realtime tracking , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[47]  Bernt Schiele,et al.  Multi-person Tracking by Multicut and Deep Matching , 2016, ECCV Workshops.

[48]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Wei Zhang,et al.  Deep Kinematic Pose Regression , 2016, ECCV Workshops.

[50]  Hans-Peter Seidel,et al.  General Automatic Human Shape and Motion Capture Using Volumetric Contour Cues , 2016, ECCV.

[51]  Vincent Lepetit,et al.  Direct Prediction of 3D Body Poses from Motion Compensated Sequences , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yinghao Huang,et al.  Towards Accurate Marker-Less Human Shape and Pose Estimation over Time , 2017, 2017 International Conference on 3D Vision (3DV).

[53]  Zhiao Huang,et al.  Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[54]  Xiaogang Wang,et al.  Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Dacheng Tao,et al.  A Coarse-Fine Network for Keypoint Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Francesc Moreno-Noguer,et al.  3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[59]  Cristian Sminchisescu,et al.  Deep Multitask Architecture for Integrated 2D and 3D Human Sensing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Bing Wang,et al.  Tracklet Association by Online Target-Specific Metric Learning and Coherent Dynamics Estimation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[64]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Qionghai Dai,et al.  Outdoor Markerless Motion Capture with Sparse Handheld Video Cameras , 2018, IEEE Transactions on Visualization and Computer Graphics.

[67]  Pascal Fua,et al.  Learning Monocular 3D Human Pose Estimation from Multi-view Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[68]  Christian Theobalt,et al.  MonoPerfCap , 2017, ACM Trans. Graph..

[69]  Haoyu Wang,et al.  Pose Flow: Efficient Online Pose Tracking , 2018, BMVC.

[70]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[71]  Taku Komura,et al.  A Sampling Approach to Generating Closely Interacting 3D Pose-Pairs from 2D Annotations , 2019, IEEE Transactions on Visualization and Computer Graphics.

[72]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.