3D hand mesh reconstruction from a monocular RGB image

Most of the existing methods for 3D hand analysis based on RGB images mainly focus on estimating hand keypoints or poses, which cannot capture geometric details of the 3D hand shape. In this work, we propose a novel method to reconstruct a 3D hand mesh from a single monocular RGB image. Different from current parameter-based or pose-based methods, our proposed method directly estimates the 3D hand mesh based on graph convolution neural network (GCN). Our network consists of two modules: the hand localization and mask generation module, and the 3D hand mesh reconstruction module. The first module, which is a VGG16-based network, is applied to localize the hand region in the input image and generate the binary mask of the hand. The second module takes the high-order features from the first and uses a GCN-based network to estimate the coordinates of each vertex of the hand mesh and reconstruct the 3D hand shape. To achieve better accuracy, a novel loss based on the differential properties of the discrete mesh is proposed. We also use professional software to create a large synthetic dataset that contains both ground truth 3D hand meshes and poses for training. To handle the real-world data, we use the CycleGAN network to transform the data domain of real-world images to that of our synthesis dataset. We demonstrate that our method can produce accurate 3D hand mesh and achieve an efficient performance for real-time applications.

[1]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[2]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Andrew W. Fitzgibbon,et al.  Learning an efficient model of hand shape variation from depth images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Didier Stricker,et al.  WHSP-Net: A Weakly-Supervised Approach for 3D Hand Shape and Pose Recovery from a Single Depth Image , 2019, Sensors.

[5]  Ge Chen,et al.  Hand joints-based gesture recognition for noisy dataset using nested interval unscented Kalman filter with LSTM network , 2018, The Visual Computer.

[6]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[8]  Qiang Li,et al.  End-to-End Hand Mesh Recovery From a Monocular RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Jianfei Cai,et al.  3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[10]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Vincent Lepetit,et al.  DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[12]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[13]  Hans-Peter Seidel,et al.  Real-Time Hand Tracking Using a Sum of Anisotropic Gaussians Model , 2014, 2014 2nd International Conference on 3D Vision.

[14]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[15]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[16]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[19]  Ignas Budvytis,et al.  Indirect deep structured learning for 3D human body shape and pose prediction , 2017, BMVC.

[20]  Junsong Yuan,et al.  Hand PointNet: 3D Hand Pose Estimation Using Point Sets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Yong Hu,et al.  Detail-preserved real-time hand motion regression from depth , 2018, The Visual Computer.

[22]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[25]  Peter V. Gehler,et al.  Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Andrea Tagliasacchi,et al.  Low-Dimensionality Calibration through Local Anisotropic Scaling for Robust Hand Model Personalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[28]  Michael E. Taylor,et al.  Differential Geometry I , 1994 .

[29]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[31]  Yaser Sheikh,et al.  Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Timothy Baldwin,et al.  Semi-supervised User Geolocation via Graph Convolutional Networks , 2018, ACL.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jure Leskovec,et al.  Modeling polypharmacy side effects with graph convolutional networks , 2018, bioRxiv.

[35]  Woontack Woo,et al.  3D Finger CAPE: Clicking Action and Position Estimation under Self-Occlusions in Egocentric Viewpoint , 2015, IEEE Transactions on Visualization and Computer Graphics.

[36]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[37]  Yong-Liang Yang,et al.  HandMap: Robust Hand Pose Estimation via Intermediate Dense Guidance Map Supervision , 2018, ECCV.

[38]  Zheng Fang,et al.  DenseBody: Directly Regressing Dense 3D Human Pose and Shape From a Single Color Image , 2019, ArXiv.

[39]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  D. Levin,et al.  Linear rotation-invariant coordinates for meshes , 2005, SIGGRAPH 2005.

[41]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Andrea Tagliasacchi,et al.  Sphere-meshes for real-time hand modeling and tracking , 2016, ACM Trans. Graph..

[43]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Andy Cockburn,et al.  User-defined gestures for augmented reality , 2013, INTERACT.