Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction

We study how well different types of approaches generalise in the task of 3D hand pose estimation under single hand scenarios and hand-object interaction. We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set. Unfortunately, since the space of hand poses is highly dimensional, it is inherently not feasible to cover the whole space densely, despite recent efforts in collecting large-scale training datasets. This sampling problem is even more severe when hands are interacting with objects and/or inputs are RGB rather than depth images, as RGB images also vary with lighting conditions and colors. To address these issues, we designed a public challenge (HANDS'19) to evaluate the abilities of current 3D hand pose estimators (HPEs) to interpolate and extrapolate the poses of a training set. More exactly, HANDS'19 is designed (a) to evaluate the influence of both depth and color modalities on 3D hand pose estimation, under the presence or absence of objects; (b) to assess the generalisation abilities w.r.t. four main axes: shapes, articulations, viewpoints, and objects; (c) to explore the use of a synthetic hand model to fill the gaps of current datasets. Through the challenge, the overall accuracy has dramatically improved over the baseline, especially on extrapolation tasks, from 27mm to 13mm mean joint error. Our analyses highlight the impacts of: Data pre-processing, ensemble approaches, the use of a parametric 3D hand model (MANO), and different HPE methods/backbones.

[1]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[2]  Shanxin Yuan,et al.  The 2017 Hands in the Million Challenge on 3D Hand Pose Estimation , 2017, ArXiv.

[3]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cordelia Schmid,et al.  LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Charless C. Fowlkes,et al.  Multi-layer Depth and Epipolar Feature Transformers for 3D Scene Reconstruction , 2019, CVPR Workshops.

[6]  Yichen Wei,et al.  Model-Based Deep Hand Pose Estimation , 2016, IJCAI.

[7]  Qiang Li,et al.  End-to-End Hand Mesh Recovery From a Monocular RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Christian Theobalt,et al.  Real-Time Hand Tracking Under Occlusion from an Egocentric RGB-D Sensor , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[9]  Qi Qi,et al.  AWR: Adaptive Weighting Regression for 3D Hand Pose Estimation , 2020, AAAI.

[10]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[11]  Ying Wu,et al.  Modeling the constraints of human hand motion , 2000, Proceedings Workshop on Human Motion.

[12]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[13]  Dimitrios Tzionas,et al.  Embodied hands , 2017, ACM Trans. Graph..

[14]  Sergio Escalera,et al.  Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Tae-Kyun Kim,et al.  Sampling Strategies for GAN Synthetic Data , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Qi Ye,et al.  BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Daniel Thalmann,et al.  Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Dongheui Lee,et al.  Point-To-Pose Voting Based Hand Pose Estimation Using Residual Permutation Equivariant Layer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Pavlo Molchanov,et al.  Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints , 2020, ECCV.

[21]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Alan L. Yuille,et al.  Patch-based 3D Human Pose Refinement , 2019, CVPR Workshops.

[23]  Karthik Ramani,et al.  DeepHand: Robust Hand Pose Estimation by Completing a Matrix Imputed with Deep Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yang Xiao,et al.  A2J: Anchor-to-Joint Regression Network for 3D Articulated Pose Estimation From a Single Depth Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Zhaohui Zhang,et al.  HandAugment: A Simple Data Augmentation Method for Depth-Based 3D Hand Pose Estimation , 2020, ArXiv.

[26]  Angela Yao,et al.  Aligning Latent Spaces for 3D Hand Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Linda G. Shapiro,et al.  ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation , 2018, ECCV.

[28]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Tae-Kyun Kim,et al.  Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[32]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[33]  Josef Kittler,et al.  Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[37]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Jianfei Cai,et al.  3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[39]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[41]  Pengfei Ren,et al.  SRN: Stacked Regression Network for Real-time 3D Hand Pose Estimation , 2019, BMVC.

[42]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[43]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Dong Tian,et al.  FoldingNet: Point Cloud Auto-Encoder via Deep Grid Deformation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Vincent Lepetit,et al.  DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[46]  Kyoung Mu Lee,et al.  V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Mahdi Rad,et al.  HO-3D: A Multi-User, Multi-Object Dataset for Joint 3D Hand-Object Pose Estimation , 2019, ArXiv.

[48]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[49]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).