Consistent 3D Hand Reconstruction in Video via self-supervised Learning

We present a method for reconstructing accurate and consistent 3D hands from a monocular video. We observe that detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand, which can reduce or even eliminate the requirement on 3D hand annotation. Thus we propose S2HAND, a self-supervised 3D hand reconstruction model, that can jointly estimate pose, shape, texture, and the camera viewpoint from a single RGB input through the supervision of easily accessible 2D detected keypoints. We leverage the continuous hand motion information contained in the unlabeled video data and propose S2HAND(V), which uses a set of weights shared S2HAND to process each frame and exploits additional motion, texture, and shape consistency constrains to promote more accurate hand poses and more consistent shapes and textures. Experiments on benchmark datasets demonstrate that our self-supervised approach produces comparable hand reconstruction performance compared with the recent full-supervised methods in single-frame as input setup, and notably improves the reconstruction accuracy and consistency when using video training data.

[1]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Junsong Yuan,et al.  3D Hand Pose Estimation Using Synthetic Data and Weakly Labeled RGB Images , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  David J. Fleet,et al.  Model-based hand tracking with texture, shading and self-occlusions , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Ilija Radosavovic,et al.  Reconstructing Hand-Object Interactions in the Wild , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Vincent Lepetit,et al.  HOnnotate: A Method for 3D Annotation of Hand and Object Poses , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Baoxin Li,et al.  Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition , 2019, IEEE Transactions on Image Processing.

[10]  Jianfei Cai,et al.  Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images , 2018, ECCV.

[11]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[12]  Xiaolong Wang,et al.  Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Hyung Jin Chang,et al.  SeqHAND: RGB-Sequence-Based 3D Hand Pose and Shape Estimation , 2020, ECCV.

[14]  D. Thalmann,et al.  Robust 3 D Hand Pose Estimation From Single Depth Images Using MultiView CNNs , 2018 .

[15]  Horst Bischof,et al.  Learning Pose Specific Representations by Predicting Different Views , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Otmar Hilliges,et al.  Structured Prediction Helps 3D Human Motion Modelling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Yi Xu,et al.  Quaternion Product Units for Deep Learning on 3D Rotation Groups , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Sergio Escalera,et al.  Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Jihun Yu,et al.  HUMBI: A Large Multiview Dataset of Human Body Expressions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[23]  Dejun Zhang,et al.  SO-HandNet: Self-Organizing Network for 3D Hand Pose Estimation With Semi-Supervised Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[25]  Daniel Thalmann,et al.  Robust 3D Hand Pose Estimation From Single Depth Images Using Multi-View CNNs , 2018, IEEE Transactions on Image Processing.

[26]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[27]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[28]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[29]  Iasonas Kokkinos,et al.  Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yonggen Ling,et al.  Self-Supervised Learning of Detailed 3D Face Reconstruction , 2020, IEEE Transactions on Image Processing.

[31]  Woontack Woo,et al.  Two-handed tangible interaction techniques for composing augmented blocks , 2011, Virtual Reality.

[32]  Christian Theobalt,et al.  Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[34]  Yunhui Liu,et al.  Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction , 2020, ECCV.

[35]  Pengfei Wan,et al.  Camera-Space Hand Mesh Recovery via Semantic Aggregation and Adaptive 2D-1D Registration , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yujin Chen,et al.  Joint Hand-object 3D Reconstruction from a Single Image with Cross-branch Feature Fusion , 2020, ArXiv.

[37]  Dario Pavllo,et al.  Modeling Human Motion with Quaternion-Based Neural Networks , 2019, International Journal of Computer Vision.

[38]  Marcel Campen,et al.  A Simple Approach to Intrinsic Correspondence Learning on Unstructured 3D Meshes , 2018, ECCV Workshops.

[39]  Nadia Magnenat-Thalmann,et al.  Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Tae-Kyun Kim,et al.  Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Vincent Lepetit,et al.  Efficient Physics-Based Implementation for Realistic Hand-Object Interaction in Virtual Reality , 2018, 2018 IEEE Conference on Virtual Reality and 3D User Interfaces (VR).

[43]  Stan Sclaroff,et al.  Estimating 3D hand pose from a cluttered image , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[44]  Angela Yao,et al.  Disentangling Latent Hands for Image Synthesis and Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Christian Theobalt,et al.  Monocular Real-time Full Body Capture with Inter-part Correlations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Tae-Kyun Kim,et al.  Weakly-Supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Tatsuya Harada,et al.  Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Luc Van Gool,et al.  Self-Supervised 3D Hand Pose Estimation Through Training by Fitting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jiayi Wang,et al.  RGB2Hands , 2020, ACM Trans. Graph..

[50]  Petros Maragos,et al.  Exploiting 3D Hand Pose Estimation in Deep Learning-Based Sign Language Recognition from RGB Videos , 2020, ECCV Workshops.

[51]  Ji Liu,et al.  Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation , 2020, ECCV.

[52]  Jianfei Cai,et al.  3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[53]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Andrea Vedaldi,et al.  Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Francesc Moreno-Noguer,et al.  Human Motion Prediction via Spatio-Temporal Inpainting , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  J. Kautz,et al.  Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints , 2020, ECCV.

[57]  Ken Shoemake,et al.  Animating rotation with quaternion curves , 1985, SIGGRAPH.

[58]  M. Zollhöfer,et al.  Self-Supervised Multi-level Face Model Learning for Monocular Reconstruction at Over 250 Hz , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[60]  Qin Li,et al.  I2UV-HandNet: Image-to-UV Prediction Network for Accurate and High-fidelity 3D Hand Mesh Modeling , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Patrick Pérez,et al.  MoFA: Model-Based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Lijuan Wang,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yangang Wang,et al.  Hand-3d-Studio: A New Multi-View System for 3d Hand Reconstruction , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[64]  Daniel Thalmann,et al.  Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[66]  Junsong Yuan,et al.  Model-based 3D Hand Reconstruction via Self-Supervised Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Antonis A. Argyros,et al.  Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[68]  Jitendra Malik,et al.  Shape and Viewpoint without Keypoints , 2020, ECCV.

[69]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[70]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[71]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[72]  Kyoung Mu Lee,et al.  I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image , 2020, ECCV.

[73]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[74]  Stefanos Zafeiriou,et al.  Single Image 3D Hand Reconstruction with Mesh Convolutions , 2019, BMVC.

[75]  Yana Hasson,et al.  Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Christian Theobalt,et al.  HTML: A Parametric Hand Texture Model for 3D Hand Reconstruction and Personalization , 2020, ECCV.

[77]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).