Transformer Inertial Poser: Real-time Human Motion Reconstruction from Sparse IMUs with Simultaneous Terrain Generation

Real-time human motion reconstruction from a sparse set of (e.g. six) wearable IMUs provides a non-intrusive and economic approach to motion capture. Without the ability to acquire position information directly from IMUs, recent works took data-driven approaches that utilize large human motion datasets to tackle this under-determined problem. Still, challenges remain such as temporal consistency, drifting of global and joint motions, and diverse coverage of motion types on various terrains. We propose a novel method to simultaneously estimate full-body motion and generate plausible visited terrain from only six IMU sensors in real-time. Our method incorporates 1. a conditional Transformer decoder model giving consistent predictions by explicitly reasoning prediction history, 2. a simple yet general learning target named "stationary body points” (SBPs) which can be stably predicted by the Transformer model and utilized by analytical routines to correct joint and global drifting, and 3. an algorithm to generate regularized terrain height maps from noisy SBP predictions which can in turn correct noisy global motion estimation. We evaluate our framework extensively on synthesized and real IMU data, and with real-time live demos, and show superior performance over strong baseline methods.

[1]  C. Theobalt,et al.  Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Julien P. C. Valentin,et al.  Learning to Fit Morphable Models , 2021, ECCV.

[3]  J. Shotton,et al.  Full-Body Motion from a Single Head-Mounted Device: Generating SMPL Poses from Partial Observations , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Christopher D. Twigg,et al.  EM-POSE: 3D Human Pose Estimation from Sparse Electromagnetic Trackers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Hyeonseung Im,et al.  Human motion reconstruction using deep transformer networks , 2021, Pattern Recognit. Lett..

[6]  Guillermo Valle Pérez,et al.  Transflower , 2021, ACM Trans. Graph..

[7]  Kris Kitani,et al.  Dynamics-Regulated Kinematic Policy for Egocentric Pose Estimation , 2021, NeurIPS.

[8]  Yuxiao Zhou,et al.  TransPose , 2021, ACM Trans. Graph..

[9]  Leonidas J. Guibas,et al.  HuMoR: 3D Human Motion Model for Robust Pose Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Michael J. Black,et al.  Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Torsten Sattler,et al.  Human POSEitioning System (HPS): 3D Human Pose Estimation and Self-localization in Large Scenes from Body-Mounted Sensors , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sung-Hee Lee,et al.  LoBSTr: Real‐time Lower‐body Pose Prediction from Sparse Upper‐body Tracking Signals , 2021, Comput. Graph. Forum.

[13]  Henry Fuchs,et al.  Mobile. Egocentric Human Body Motion Reconstruction Using Only Eyeglasses-mounted Cameras and a Few Body-worn Inertial Sensors , 2021, 2021 IEEE Virtual Reality and 3D User Interfaces (VR).

[14]  Noel C. Perkins,et al.  Robust Error-State Kalman Filter for Estimating IMU Orientation , 2021, IEEE Sensors Journal.

[15]  David A. Ross,et al.  AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Manuel Kaufmann,et al.  A Spatio-temporal Transformer for 3D Human Motion Prediction , 2020, 2021 International Conference on 3D Vision (3DV).

[17]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Christian Theobalt,et al.  PhysCap , 2020, ACM Trans. Graph..

[19]  Oussama Kanoun,et al.  Learned motion matching , 2020, ACM Trans. Graph..

[20]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[21]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[22]  Wenjun Zeng,et al.  Fusing Wearable IMUs With Multi-View Images for Human Pose Estimation: A Geometric Approach , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dirk Werth,et al.  An RNN-Ensemble approach for Real Time Human Pose Estimation from Sparse IMUs , 2020 .

[24]  Charles Malleson,et al.  Real-Time Multi-person Motion Capture from Multi-view Video and IMUs , 2019, International Journal of Computer Vision.

[25]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[26]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  N. Lovell,et al.  Computationally Efficient Adaptive Error-State Kalman Filter for Attitude Estimation , 2018, IEEE Sensors Journal.

[30]  Michael J. Black,et al.  Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real Time , 2018 .

[31]  Tao Yu,et al.  HybridFusion: Real-Time Performance Capture Using a Single Depth Sensor and Sparse IMUs , 2018, ECCV.

[32]  Charles Malleson,et al.  Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation , 2018, International Journal of Computer Vision.

[33]  Bodo Rosenhahn,et al.  Supplementary Material to: Recovering Accurate 3D Human Pose in The Wild Using IMUs and a Moving Camera , 2018 .

[34]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[36]  J. Collomosse,et al.  Real-Time Full-Body Motion Capture from Video and IMUs , 2017, 2017 International Conference on 3D Vision (3DV).

[37]  Charles Malleson,et al.  Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors , 2017, BMVC.

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Assa Doron,et al.  Mobile , 2017 .

[40]  Bodo Rosenhahn,et al.  Sparse Inertial Poser: Automatic 3D Human Pose Estimation from Sparse IMUs , 2017, Comput. Graph. Forum.

[41]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[42]  Taku Komura,et al.  Real-time Physics-based Motion Capture with Sparse Sensors , 2016, CVMP 2016.

[43]  Bodo Rosenhahn,et al.  Human Pose Estimation from Video and IMUs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[47]  Hans-Peter Seidel,et al.  Real-Time Body Tracking with One Depth Camera and Inertial Sensors , 2013, 2013 IEEE International Conference on Computer Vision.

[48]  Jinxiang Chai,et al.  Accurate realtime full-body motion capture using a single depth camera , 2012, ACM Trans. Graph..

[49]  Andrew W. Fitzgibbon,et al.  The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Hans-Peter Seidel,et al.  Outdoor human motion capture using inverse kinematics and von mises-fisher sampling , 2011, 2011 International Conference on Computer Vision.

[51]  Taehyun Rhee,et al.  Realtime human motion control with a small number of inertial sensors , 2011, SI3D.

[52]  Bodo Rosenhahn,et al.  Multisensor-fusion for 3D full-body human motion capture , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[53]  Jovan Popović,et al.  Practical motion capture in everyday surroundings , 2007, ACM Trans. Graph..

[54]  Ronan Boulic,et al.  Robust kinematic constraint detection for motion data , 2006, SCA '06.

[55]  Hugh F. Durrant-Whyte,et al.  Simultaneous localization and mapping: part I , 2006, IEEE Robotics & Automation Magazine.

[56]  Michael Zyda,et al.  Inertial and magnetic posture tracking for inserting humans into networked virtual environments , 2001, VRST '01.

[57]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[58]  Eric Foxlin,et al.  Inertial head-tracker sensor fusion by a complementary separate-bias Kalman filter , 1996, Proceedings of the IEEE 1996 Virtual Reality Annual International Symposium.