MonoEye: Multimodal Human Motion Capture System Using A Single Ultra-Wide Fisheye Camera

We present MonoEye, a multimodal human motion capture system using a single RGB camera with an ultra-wide fisheye lens, mounted on the user's chest. Existing optical motion capture systems use multiple cameras, which are synchronized and require camera calibration. These systems also have usability constraints that limit the user's movement and operating space. Since the MonoEye system is based on a wearable single RGB camera, the wearer's 3D body pose can be captured without space and environment limitations. The body pose, captured with our system, is aware of the camera orientation and therefore it is possible to recognize various motions that existing egocentric motion capture systems cannot recognize. Furthermore, the proposed system captures not only the wearer's body motion but also their viewport using the head pose estimation and an ultra-wide image. To implement robust multimodal motion capture, we design three deep neural networks: BodyPoseNet, HeadPoseNet, and CameraPoseNet, that estimate 3D body pose, head pose, and camera pose in real-time, respectively. We train these networks with our new extensive synthetic dataset providing 680K frames of renderings of people with a wide range of body shapes, clothing, actions, backgrounds, and lighting conditions. To demonstrate the interactive potential of the MonoEye system, we present several application examples from common body gestural to context-aware interactions.

[1]  Robert Xiao,et al.  MeCap: Whole-Body Digitization for Low-Cost VR/AR Headsets , 2019, UIST.

[2]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Keita Higuchi,et al.  Egoscanning: quickly scanning first-person videos with egocentric elastic timelines , 2017, SIGGRAPH ASIA Emerging Technologies.

[4]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[5]  Adrian Hilton,et al.  Visual Analysis of Humans - Looking at People , 2013 .

[6]  Deva Ramanan,et al.  First-person pose recognition using egocentric workspaces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yaser Sheikh,et al.  Motion capture from body-mounted cameras , 2011, ACM Trans. Graph..

[8]  Hans-Peter Seidel,et al.  A data-driven approach for real-time full body pose reconstruction from a depth camera , 2011, 2011 International Conference on Computer Vision.

[9]  Bernt Schiele,et al.  Multi-view Pictorial Structures for 3D Human Pose Estimation , 2013, BMVC.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  James M. Rehg,et al.  Fine-Grained Head Pose Estimation Without Keypoints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[12]  Hans-Peter Seidel,et al.  A Versatile Scene Model with Differentiable Visibility Applied to Generative Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[14]  Yung-Yu Chuang,et al.  FSA-Net: Learning Fine-Grained Structure Aggregation for Head Pose Estimation From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Hans-Peter Seidel,et al.  EgoCap , 2016, ACM Trans. Graph..

[16]  Xiaowei Zhou,et al.  Harvesting Multiple Views for Marker-Less 3D Human Pose Annotations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jian Sun,et al.  Face Alignment by Explicit Shape Regression , 2012, International Journal of Computer Vision.

[18]  Da-Yuan Huang,et al.  Cyclops: Wearable and Single-Piece Full-Body Gesture Input Devices , 2015, CHI.

[19]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Hans-Peter Seidel,et al.  Performance capture from sparse multi-view video , 2008, ACM Trans. Graph..

[21]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[22]  Olga Veksler,et al.  Multi-label Moves for MRFs with Truncated Convex Priors , 2009, International Journal of Computer Vision.

[23]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Michael Isard,et al.  Loose-limbed People: Estimating 3D Human Pose and Motion Using Non-parametric Belief Propagation , 2011, International Journal of Computer Vision.

[25]  Emre Akbas,et al.  Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Victor Lempitsky,et al.  Learnable Triangulation of Human Pose , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Rama Chellappa,et al.  HyperFace: A Deep Multi-Task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[29]  Kris M. Kitani,et al.  3D Ego-Pose Estimation via Imitation Learning , 2018, ECCV.

[30]  Fernando De la Torre,et al.  Global supervised descent method , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hans-Peter Seidel,et al.  Optimization and Filtering for Human Motion Capture , 2010, International Journal of Computer Vision.

[32]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[33]  Kristen Grauman,et al.  Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric Video , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[35]  Rama Chellappa,et al.  KEPLER: Keypoint and Pose Estimation of Unconstrained Faces by Learning Efficient H-CNN Regressors , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[36]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Pascal Fua,et al.  Mo2Cap2: Real-time Mobile 3D Motion Capture with a Cap-mounted Fisheye Camera , 2018, IEEE Transactions on Visualization and Computer Graphics.

[38]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Xiangyu Zhu,et al.  Face Alignment in Full Pose Range: A 3D Total Solution , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Pascal Fua,et al.  Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Hans-Peter Seidel,et al.  Fast articulated motion tracking using a sums of Gaussians body model , 2011, 2011 International Conference on Computer Vision.

[43]  Hwann-Tzong Chen,et al.  Self Adversarial Training for Human Pose Estimation , 2017, 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[44]  Xiu-Shen Wei,et al.  Adversarial PoseNet: A Structure-Aware Convolutional Network for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Ramakant Nevatia,et al.  FacePoseNet: Making a Case for Landmark-Free Face Alignment , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[47]  Mohan M. Trivedi,et al.  Human Pose Estimation and Activity Recognition From Multi-View Videos: Comparative Explorations of Recent Developments , 2012, IEEE Journal of Selected Topics in Signal Processing.

[48]  Nicolas Roussel,et al.  1 € filter: a simple speed-based low-pass filter for noisy input in interactive systems , 2012, CHI.

[49]  Krista A. Ehinger,et al.  Recognizing scene viewpoint using panoramic place representation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Josephine Sullivan,et al.  One millisecond face alignment with an ensemble of regression trees , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Xiaogang Wang,et al.  Multi-context Attention for Human Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Stefan Carlsson,et al.  3D Pictorial Structures for Multiple View Articulated Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Michael J. Black,et al.  Guest Editorial: State of the Art in Image- and Video-Based Human Pose and Motion Estimation , 2010, International Journal of Computer Vision.

[54]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Kris Kitani,et al.  Ego-Pose Estimation and Forecasting As Real-Time PD Control , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Jonathan Tompson,et al.  Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Hans-Peter Seidel,et al.  Model-Based Outdoor Performance Capture , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[58]  Jitendra Malik,et al.  Tracking people with twists and exponential maps , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[59]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[61]  Jason Hong,et al.  Considering privacy issues in the context of Google glass , 2013, CACM.

[62]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[63]  Jinxiang Chai,et al.  Accurate realtime full-body motion capture using a single depth camera , 2012, ACM Trans. Graph..

[64]  David J. Fleet,et al.  Temporal motion models for monocular and multiview 3D human body tracking , 2006, Comput. Vis. Image Underst..

[65]  Larry S. Davis,et al.  Model-based object pose in 25 lines of code , 1992, International Journal of Computer Vision.

[66]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[67]  Yukinobu Taniguchi,et al.  Egocentric articulated pose tracking for action recognition , 2015, 2015 14th IAPR International Conference on Machine Vision Applications (MVA).

[68]  Lourdes Agapito,et al.  xR-EgoPose: Egocentric 3D Human Pose From an HMD Camera , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Jan-Michael Frahm,et al.  Towards Fully Mobile 3D Face, Body, and Environment Capture Using Only Head-worn Cameras , 2018, IEEE Transactions on Visualization and Computer Graphics.

[70]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[71]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[73]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[74]  Kristen Grauman,et al.  You2Me: Inferring Body Pose in Egocentric Video via First and Second Person Interactions , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[76]  Xiaowei Zhou,et al.  Ordinal Depth Supervision for 3D Human Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.