HO-3D: A Multi-User, Multi-Object Dataset for Joint 3D Hand-Object Pose Estimation

We propose a new dataset for 3D hand+object pose estimation from color images, together with a method for efficiently annotating this dataset, and a 3D pose prediction method based on this dataset. The current lack of training data makes the 3D hand+object pose estimation very challenging. This lack is due to the complexity of labeling many real images with both 3D poses and of generating synthetic images with various realistic interaction. Moreover, even if synthetic images could be used for training, annotated real images are still needed for validation. To tackle this challenge, we capture sequences with a simple setup made of a single RGB-D camera. We also use a color camera imaging the sequences from a side view, but only for validation. We introduce a novel method based on global optimization that exploits depth, color, and temporal constraints for efficiently annotating the sequences, which we use to train another novel method that predicts both the 3D poses of the hand and the object from a single color image. Our hope is to encourage other researchers to develop better annotation methods for our dataset: One can then apply such method to capture and easily annotate sequences captured with a single RGB-D camera to easily create additional training data thus solving one of the main problems of 3D hand+object pose estimation.

[1]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[2]  Tae-Kyun Kim,et al.  Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Pascal Fua,et al.  Real-Time Seamless Single Shot 6D Object Pose Prediction , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[5]  Tae-Kyun Kim,et al.  Siamese Regression Networks with Efficient mid-level Feature Extraction for 3D Object Pose Estimation , 2016, ArXiv.

[6]  Yichen Wei,et al.  Model-Based Deep Hand Pose Estimation , 2016, IJCAI.

[7]  Eric Brachmann,et al.  Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Andrei Barbu,et al.  Partially Occluded Hands: - A Challenging New Dataset for Single-Image Hand Pose Estimation , 2018, ACCV.

[9]  Danica Kragic,et al.  Non-parametric hand pose estimation with object context , 2013, Image Vis. Comput..

[10]  Sang Ho Yoon,et al.  Robust Hand Pose Estimation during the Interaction with an Unknown Object , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Eric Brachmann,et al.  iPose: Instance-Aware 6D Pose Estimation of Partly Occluded Objects , 2017, ACCV.

[12]  Stefan Lee,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Bodo Rosenhahn,et al.  Real-Time Sign Language Recognition Using a Consumer Depth Camera , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[15]  Kostas E. Bekris,et al.  Improving 6D Pose Estimation of Objects in Clutter Via Physics-Aware Monte Carlo Tree Search , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Vincent Lepetit,et al.  BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[18]  Antonis A. Argyros,et al.  Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Antonis A. Argyros,et al.  3D Tracking of Human Hands in Interaction with Unknown Objects , 2015, BMVC.

[20]  Vincent Lepetit,et al.  Efficiently Creating 3D Training Data for Fine Hand Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Luc Van Gool,et al.  Tracking a hand manipulating an object , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Antonis A. Argyros,et al.  Back to RGB: 3D Tracking of Hands and Hand-Object Interactions Based on Short-Baseline Stereo , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[23]  Liang Chang,et al.  Hand3D: Hand Pose Estimation using 3D Neural Network , 2017, ArXiv.

[24]  Danica Kragic,et al.  Hands in action: real-time 3D reconstruction of hands in interaction with objects , 2010, 2010 IEEE International Conference on Robotics and Automation.

[25]  Dirk Kraft,et al.  Rotational Subgroup Voting and Pose Clustering for Robust 3D Object Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Sylvain Paris,et al.  6D hands: markerless hand-tracking for computer aided design , 2011, UIST.

[27]  Sterling Orsten,et al.  Dynamics based 3D skeletal hand tracking , 2013, I3D '13.

[28]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[29]  Aaron M. Dollar,et al.  The Yale human grasping dataset: Grasp, object, and task data in household and machine shop environments , 2015, Int. J. Robotics Res..

[30]  Haoruo Zhang,et al.  Combined Holistic and Local Patches for Recovering 6D Object Pose , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[31]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[32]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[33]  Silvio Savarese,et al.  Scene Semantic Reconstruction from Egocentric RGB-D-Thermal Videos , 2017, 2017 International Conference on 3D Vision (3DV).

[34]  Vincent Lepetit,et al.  Feature Mapping for Learning Fast and Accurate 3D Pose Inference from Synthetic Images , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Fei Qiao,et al.  Region ensemble network: Improving convolutional network for hand pose estimation , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[36]  Antonis A. Argyros,et al.  Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints , 2011, 2011 International Conference on Computer Vision.

[37]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[38]  C. V. Jawahar,et al.  First Person Action Recognition Using Deep Learned Descriptors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Deva Ramanan,et al.  Understanding Everyday Hands in Action from RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Qi Ye,et al.  Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation , 2016, ECCV.

[41]  Luc Van Gool,et al.  Motion Capture of Hands in Action Using Discriminative Salient Points , 2012, ECCV.

[42]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[43]  Yi Li,et al.  DeepIM: Deep Iterative Matching for 6D Pose Estimation , 2018, International Journal of Computer Vision.

[44]  Nassir Navab,et al.  Deep Learning of Local RGB-D Patches for 3D Object Detection and 6D Pose Estimation , 2016, ECCV.

[45]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[47]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Antonis A. Argyros,et al.  Capturing and Reproducing Hand-Object Interactions Through Vision-Based Force Sensing , 2015 .

[49]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Dimitrios Tzionas,et al.  Embodied hands , 2017, ACM Trans. Graph..

[51]  Christian Wolf,et al.  Hand pose estimation through semi-supervised and weakly-supervised learning , 2015, Comput. Vis. Image Underst..

[52]  Dimitrios Tzionas,et al.  3D Object Reconstruction from Hand-Object Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Antti Oulasvirta,et al.  Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data , 2013, 2013 IEEE International Conference on Computer Vision.

[54]  Dimitrios Tzionas,et al.  Capturing Hand Motion with an RGB-D Sensor, Fusing a Generative Model with Salient Points , 2014, GCPR.

[55]  Antonis A. Argyros,et al.  Towards force sensing from vision: Observing hand-object interactions to infer manipulation forces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[57]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[58]  Deva Ramanan,et al.  3D Hand Pose Detection in Egocentric RGB-D Images , 2014, ECCV Workshops.

[59]  Christian Theobalt,et al.  Real-Time Hand Tracking Under Occlusion from an Egocentric RGB-D Sensor , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[60]  Yu Zhang,et al.  Lie-X: Depth Image Based Articulated Object Pose Estimation, Tracking, and Action Recognition on Lie Groups , 2016, International Journal of Computer Vision.

[61]  Yoichi Sato,et al.  A scalable approach for understanding the visual structures of hand grasps , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[62]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[63]  Vincent Lepetit,et al.  DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[64]  John P. Lewis,et al.  Pose Space Deformation: A Unified Approach to Shape Interpolation and Skeleton-Driven Deformation , 2000, SIGGRAPH.

[65]  Andrew W. Fitzgibbon,et al.  Accurate, Robust, and Flexible Real-time Hand Tracking , 2015, CHI.

[66]  Antonis A. Argyros,et al.  Hand-Object Contact Force Estimation from Markerless Visual Tracking , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Antonis A. Argyros,et al.  Joint 3D Tracking of a Deformable Object in Interaction with a Hand , 2018, ECCV.

[68]  Ivor W. Tsang,et al.  Domain Adaptation via Transfer Component Analysis , 2009, IEEE Transactions on Neural Networks.

[69]  Danica Kragic,et al.  Learning to Estimate Pose and Shape of Hand-Held Objects from RGB Images , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[70]  Antonis A. Argyros,et al.  Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[71]  Aphrodite Galata,et al.  3D Hand-Object Pose Estimation from Depth with Convolutional Neural Networks , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[72]  George Trigeorgis,et al.  Domain Separation Networks , 2016, NIPS.

[73]  Zoltan-Csaba Marton,et al.  Implicit 3D Orientation Learning for 6D Object Detection from RGB Images , 2018, ECCV.

[74]  Vincent Lepetit,et al.  Domain Transfer for 3D Pose Estimation from Color Images without Manual Annotations , 2018, ACCV.

[75]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[76]  Bernhard Schölkopf,et al.  Domain Generalization via Invariant Feature Representation , 2013, ICML.

[77]  Tae-Kyun Kim,et al.  Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests , 2013, 2013 IEEE International Conference on Computer Vision.

[78]  Chen Qian,et al.  Realtime and Robust Hand Tracking from Depth , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[79]  Antonis A. Argyros,et al.  Scalable 3D Tracking of Multiple Interacting Objects , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[80]  Nassir Navab,et al.  SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[81]  Darius Burschka,et al.  Predicting human intention in visual observations of hand/object interactions , 2013, 2013 IEEE International Conference on Robotics and Automation.

[82]  Jian Sun,et al.  Cascaded hand pose regression , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[84]  Ana Cristina Murillo,et al.  Experiments on an RGB-D Wearable Vision System for Egocentric Activity Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[85]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[86]  Lale Akarun,et al.  Hand Pose Estimation and Hand Shape Classification Using Multi-layered Randomized Decision Forests , 2012, ECCV.

[87]  Qi Ye,et al.  BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[88]  Kris M. Kitani,et al.  Going Deeper into First-Person Activity Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).