Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation

We propose an approach to accurately estimate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose embedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull. The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complementary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accuracy over prior methods. Extensive evaluation is performed with state of the art performance reported on the popular Human 3.6M dataset (Ionescu et al. in Intell IEEE Trans Pattern Anal Mach 36(7):1325–1339, 2014), the newly released TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor. We release the new hybrid MVV dataset (TotalCapture) comprising of multi-viewpoint video, IMU and accurate 3D skeletal joint ground truth derived from a commercial motion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.

[1]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[2]  Hao Jiang,et al.  Human Pose Estimation Using Consistent Max Covering , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Ho Yub Jung,et al.  A Sequential Approach to 3D Human Pose Estimation: Separation of Localization and Identification of Body Joints , 2016, ECCV.

[5]  John P. Collomosse,et al.  Visual Sentences for Pose Retrieval Over Low-Resolution Cross-Media Dance Collections , 2012, IEEE Transactions on Multimedia.

[6]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[10]  Deva Ramanan,et al.  Articulated pose estimation with tiny synthetic videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Towards Accurate Markerless Human Shape and Pose Estimation over Time Anonymous ICCV submission , 2017 .

[12]  Bodo Rosenhahn,et al.  Multisensor-fusion for 3D full-body human motion capture , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13]  Vincent Lepetit,et al.  Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[14]  Hans-Peter Seidel,et al.  General Automatic Human Shape and Motion Capture Using Volumetric Contour Cues , 2016, ECCV.

[15]  Hans-Peter Seidel,et al.  Real-Time Body Tracking with One Depth Camera and Inertial Sensors , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Charles Malleson,et al.  Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors , 2017, BMVC.

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Yoshua Bengio,et al.  Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.

[21]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Adrian Hilton,et al.  Hybrid Skeletal-Surface Motion Graphs for Character Animation from 4D Performance Capture , 2015, TOGS.

[23]  Hao Jiang Human pose estimation using consistent max-covering , 2009, ICCV.

[24]  Pascal Fua,et al.  Fusing 2D Uncertainty and 3D Cues for Monocular Body Pose Estimation , 2016, ArXiv.

[25]  Bodo Rosenhahn,et al.  Analyzing and Evaluating Markerless Motion Tracking Using Inertial Sensors , 2010, ECCV Workshops.

[26]  Ignas Budvytis,et al.  Indirect deep structured learning for 3D human body shape and pose prediction , 2017, BMVC.

[27]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[29]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, CVPR.

[30]  Nassir Navab,et al.  Discriminative Human Full-Body Pose Estimation from Wearable Inertial Sensor Data , 2009, 3DPH.

[31]  Ankur Agarwal,et al.  3D human pose from silhouettes by relevance vector regression , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[32]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Fiora Pirri,et al.  Bayesian Image Based 3D Pose Estimation , 2016, ECCV.

[34]  Hui Cheng,et al.  Recurrent 3D Pose Sequence Machines , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[36]  Adrian Hilton,et al.  Deep Autoencoder for Combined Human Pose Estimation and body Model Upscaling , 2018, ECCV.

[37]  Taehyun Rhee,et al.  Realtime human motion control with a small number of inertial sensors , 2011, SI3D.

[38]  Lourdes Agapito,et al.  Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jonathan Tompson,et al.  Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Sehoon Ha,et al.  Iterative Training of Dynamic Skills Inspired by Human Coaching Techniques , 2014, ACM Trans. Graph..

[41]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Trevor Darrell,et al.  A Bayesian approach to image-based visual hull reconstruction , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[43]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[44]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[45]  J. Collomosse,et al.  Real-Time Full-Body Motion Capture from Video and IMUs , 2017, 2017 International Conference on 3D Vision (3DV).

[46]  Jianbo Shi,et al.  Bottom-up Recognition and Parsing of the Human Body , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  D. Roetenberg,et al.  Xsens MVN: Full 6DOF Human Motion Tracking Using Miniature Inertial Sensors , 2009 .

[48]  Deva Ramanan,et al.  3D Human Pose Estimation = 2D Pose Estimation + Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Cordelia Schmid,et al.  MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[50]  Adrian Hilton,et al.  Deep Convolutional Networks for Marker-less Human Pose Estimation from Multiple Views , 2016, CVMP 2016.

[51]  Taku Komura,et al.  Real-time Physics-based Motion Capture with Sparse Sensors , 2016, CVMP 2016.

[52]  Peter V. Gehler,et al.  Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Jessica K. Hodgins,et al.  Action capture with accelerometers , 2008, SCA '08.

[55]  Antoni B. Chan,et al.  Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[56]  Hans-Peter Seidel,et al.  Outdoor human motion capture using inverse kinematics and von mises-fisher sampling , 2011, 2011 International Conference on Computer Vision.

[57]  Bodo Rosenhahn,et al.  Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs , 2017, Comput. Graph. Forum.

[58]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Daniel P. Huttenlocher,et al.  Beyond trees: common-factor models for 2D human pose recovery , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[60]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[61]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Jitendra Malik,et al.  Recovering human body configurations using pairwise constraints between parts , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[63]  Xiaowei Zhou,et al.  Harvesting Multiple Views for Marker-Less 3D Human Pose Annotations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).