Multiview Video-Based 3-D Hand Pose Estimation

Hand pose estimation (HPE) can be used for a variety of human-computer interaction applications such as gesture-based control for physical or virtual/augmented reality devices. Recent works have shown that videos or multi-view images carry rich information regarding the hand, allowing for the development of more robust HPE systems. In this paper, we present the Multi-View Video-Based 3D Hand (MuViHand) dataset, consisting of multi-view videos of the hand along with ground-truth 3D pose labels. Our dataset includes more than 402,000 synthetic hand images available in 4,560 videos. The videos have been simultaneously captured from six different angles with complex backgrounds and random levels of dynamic lighting. The data has been captured from 10 distinct animated subjects using 12 cameras in a semi-circle topology where six tracking cameras only focus on the hand and the other six fixed cameras capture the entire body. Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand, recurrent learners to learn both temporal and angular sequential information, and graph networks with U-Net architectures to estimate the final 3D pose information. We perform extensive experiments and show the challenging nature of this new dataset as well as the effectiveness of our proposed method. Ablation studies show the added value of each component in MuViHandNet, as well as the benefit of having temporal and sequential information in the dataset. We make our dataset publicly available to contribute to the field at: https://github.com/LeylaKhaleghi/MuViHand.

[1]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[2]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[3]  Li Zhuo,et al.  3D Hand Pose Estimation with Disentangled Cross-Modal Latent Space , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[4]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Katerina Fragkiadaki,et al.  Epipolar Transformers , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yan Fu,et al.  A Survey on Hand Pose Estimation with Wearable Sensors and Computer-Vision-Based Methods , 2020, Sensors.

[7]  Tae-Kyun Kim,et al.  Spatio-Temporal Hough Forest for efficient detection-localisation-recognition of fingerwriting in egocentric camera , 2016, Comput. Vis. Image Underst..

[8]  Petros Daras,et al.  Cross-modal Variational Alignment of Latent Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[9]  Christian Theobalt,et al.  Real-Time Hand Tracking Under Occlusion from an Egocentric RGB-D Sensor , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[10]  Thomas Funkhouser,et al.  An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds , 2020, ECCV.

[11]  Xilin Chen,et al.  Iterative Reference Driven Metric Learning for Signer Independent Isolated Sign Language Recognition , 2016, ECCV.

[12]  Qi Ye,et al.  Occlusion-aware Hand Pose Estimation Using Hierarchical Mixture Density Network , 2017, ECCV.

[13]  Shuiwang Ji,et al.  Graph U-Nets , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Jianrong Tan,et al.  A survey on 3D hand pose estimation: Cameras, methods, and datasets , 2019, Pattern Recognit..

[15]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Yusheng Xie,et al.  MVHM: A Large-Scale Multi-View Hand Mesh Benchmark for Accurate 3D Hand Pose Estimation , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[17]  Paulo Lobato Correia,et al.  Multi-Perspective LSTM for Joint Visual Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Albert Dipanda,et al.  Hand pose estimation and tracking in real and virtual interaction: A review , 2019, Image Vis. Comput..

[19]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jianfei Cai,et al.  Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images , 2018, ECCV.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Jianfei Cai,et al.  3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[23]  Soo-Hyung Kim,et al.  Robust Hand Shape Features for Dynamic Hand Gesture Recognition Using Multi-Level Feature LSTM , 2020, Applied Sciences.

[24]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  David J. Crandall,et al.  HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Mingliang Chen,et al.  3D Hand Pose Tracking and Estimation Using Stereo Matching , 2016, ArXiv.

[28]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Trevor Darrell,et al.  Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body Dynamics , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Paulo Lobato Correia,et al.  A Deep Framework for Facial Emotion Recognition using Light Field Images , 2019, 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII).

[31]  Yi Liu,et al.  Semi-Supervised Joint Learning for Hand Gesture Recognition from a Single Color Image , 2021, Sensors.

[32]  Shanxin Yuan,et al.  3D Hand Pose Estimation from RGB Using Privileged Learning with Depth Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[33]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[34]  Paulo Lobato Correia,et al.  Facial Emotion Recognition Using Light Field Images with Deep Attention-Based Bidirectional LSTM , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Woontack Woo,et al.  3D Finger CAPE: Clicking Action and Position Estimation under Self-Occlusions in Egocentric Viewpoint , 2015, IEEE Transactions on Visualization and Computer Graphics.

[36]  Antonis A. Argyros,et al.  Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  Mohammed Bennamoun,et al.  Attention in Convolutional LSTM for Gesture Recognition , 2018, NeurIPS.

[38]  Tobias Höllerer,et al.  Multithreaded Hybrid Feature Tracking for Markerless Augmented Reality , 2009, IEEE Transactions on Visualization and Computer Graphics.

[39]  Qi Ye,et al.  BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Cewu Lu,et al.  Online Video Object Detection Using Association LSTM , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Yunhui Liu,et al.  Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction , 2020, ECCV.

[42]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[43]  Mohan M. Trivedi,et al.  HandyNet: A One-stop Solution to Detect, Segment, Localize & Analyze Driver Hands , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[44]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Hazem Wannous,et al.  Heterogeneous hand gesture recognition using 3D dynamic skeletal data , 2019, Comput. Vis. Image Underst..

[47]  Tae-Kyun Kim,et al.  Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Paulo Lobato Correia,et al.  Long Short-Term Memory With Gate and State Level Fusion for Light Field-Based Face Recognition , 2021, IEEE Transactions on Information Forensics and Security.

[49]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Yi Yang,et al.  Depth-Based Hand Pose Estimation: Methods, Data, and Challenges , 2015, International Journal of Computer Vision.

[51]  Andy Cockburn,et al.  User-defined gestures for augmented reality , 2013, INTERACT.

[52]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[53]  Iasonas Kokkinos,et al.  Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Thomas Wolf,et al.  Monocular RGB Hand Pose Inference from Unsupervised Refinable Nets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[55]  Angela Yao,et al.  Aligning Latent Spaces for 3D Hand Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Bardia Doosti,et al.  Hand Pose Estimation: A Survey , 2019, ArXiv.

[57]  Hyung Jin Chang,et al.  SeqHAND: RGB-Sequence-Based 3D Hand Pose and Shape Estimation , 2020, ECCV.

[58]  Nadia Magnenat-Thalmann,et al.  Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).