HOnnotate: A Method for 3D Annotation of Hand and Object Poses

We propose a method for annotating images of a hand manipulating an object with the 3D poses of both the hand and the object, together with a dataset created using this method. Our motivation is the current lack of annotated real images for this problem, as estimating the 3D poses is challenging, mostly because of the mutual occlusions between the hand and the object. To tackle this challenge, we capture sequences with one or several RGB-D cameras and jointly optimize the 3D hand and object poses over all the frames simultaneously. This method allows us to automatically annotate each frame with accurate estimates of the poses, despite large mutual occlusions. With this method, we created HO-3D, the first markerless dataset of color images with 3D annotations for both the hand and object. This dataset is currently made of 77,558 frames, 68 sequences, 10 persons, and 10 objects. Using our dataset, we develop a single RGB image-based method to predict the hand pose when interacting with objects under severe occlusions and show it generalizes to objects not seen in the dataset.

[1]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[2]  Stefan Lee,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Antonis A. Argyros,et al.  Capturing and Reproducing Hand-Object Interactions Through Vision-Based Force Sensing , 2015 .

[5]  Aphrodite Galata,et al.  3D Hand-Object Pose Estimation from Depth with Convolutional Neural Networks , 2017, 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017).

[6]  Marc Pollefeys,et al.  H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Dimitrios Tzionas,et al.  3D Object Reconstruction from Hand-Object Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Antonis A. Argyros,et al.  3D Tracking of Human Hands in Interaction with Unknown Objects , 2015, BMVC.

[9]  Yu Zhang,et al.  Lie-X: Depth Image Based Articulated Object Pose Estimation, Tracking, and Action Recognition on Lie Groups , 2016, International Journal of Computer Vision.

[10]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[11]  Kostas E. Bekris,et al.  Improving 6D Pose Estimation of Objects in Clutter Via Physics-Aware Monte Carlo Tree Search , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[12]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Vincent Lepetit,et al.  Domain Transfer for 3D Pose Estimation from Color Images without Manual Annotations , 2018, ACCV.

[14]  Antonis A. Argyros,et al.  Hand-Object Contact Force Estimation from Markerless Visual Tracking , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Antonis A. Argyros,et al.  Physically Plausible 3D Scene Tracking: The Single Actor Hypothesis , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Qi Ye,et al.  BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Sylvain Paris,et al.  6D hands: markerless hand-tracking for computer aided design , 2011, UIST.

[18]  Antti Oulasvirta,et al.  Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Hujun Bao,et al.  PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yaser Sheikh,et al.  Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Sang Ho Yoon,et al.  Robust Hand Pose Estimation during the Interaction with an Unknown Object , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Andrew W. Fitzgibbon,et al.  Accurate, Robust, and Flexible Real-time Hand Tracking , 2015, CHI.

[25]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[26]  Vincent Lepetit,et al.  DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[27]  Lale Akarun,et al.  Hand Pose Estimation and Hand Shape Classification Using Multi-layered Randomized Decision Forests , 2012, ECCV.

[28]  Mahdi Rad,et al.  HO-3D: A Multi-User, Multi-Object Dataset for Joint 3D Hand-Object Pose Estimation , 2019, ArXiv.

[29]  Andrei Barbu,et al.  Partially Occluded Hands: - A Challenging New Dataset for Single-Image Hand Pose Estimation , 2018, ACCV.

[30]  Sterling Orsten,et al.  Dynamics based 3D skeletal hand tracking , 2013, I3D '13.

[31]  Haoruo Zhang,et al.  Combined Holistic and Local Patches for Recovering 6D Object Pose , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[32]  Nassir Navab,et al.  Deep Learning of Local RGB-D Patches for 3D Object Detection and 6D Pose Estimation , 2016, ECCV.

[33]  Luc Van Gool,et al.  Motion Capture of Hands in Action Using Discriminative Salient Points , 2012, ECCV.

[34]  Zoltan-Csaba Marton,et al.  Implicit 3D Orientation Learning for 6D Object Detection from RGB Images , 2018, ECCV.

[35]  Andrew W. Fitzgibbon,et al.  Fits Like a Glove: Rapid and Reliable Hand Shape Personalization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Danica Kragic,et al.  Learning to Estimate Pose and Shape of Hand-Held Objects from RGB Images , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[37]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[38]  Antonis A. Argyros,et al.  Joint 3D Tracking of a Deformable Object in Interaction with a Hand , 2018, ECCV.

[39]  Deva Ramanan,et al.  Understanding Everyday Hands in Action from RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Vincent Lepetit,et al.  Making Deep Heatmaps Robust to Partial Occlusions for 3D Object Pose Estimation , 2018, ECCV.

[41]  Yichen Wei,et al.  Model-Based Deep Hand Pose Estimation , 2016, IJCAI.

[42]  Antonis A. Argyros,et al.  Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints , 2011, 2011 International Conference on Computer Vision.

[43]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Dirk Kraft,et al.  Rotational Subgroup Voting and Pose Clustering for Robust 3D Object Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Christian Wolf,et al.  Hand pose estimation through semi-supervised and weakly-supervised learning , 2015, Comput. Vis. Image Underst..

[46]  Jian Sun,et al.  Cascaded hand pose regression , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Antonis A. Argyros,et al.  Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[48]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[49]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[50]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[51]  Luc Van Gool,et al.  Self-Supervised 3D Hand Pose Estimation Through Training by Fitting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Pascal Fua,et al.  Segmentation-Driven 6D Object Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Chen Qian,et al.  Realtime and Robust Hand Tracking from Depth , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Qi Ye,et al.  Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation , 2016, ECCV.

[55]  Antonis A. Argyros,et al.  Scalable 3D Tracking of Multiple Interacting Objects , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Yoichi Sato,et al.  A scalable approach for understanding the visual structures of hand grasps , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[57]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[58]  Tae-Kyun Kim,et al.  Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Aaron M. Dollar,et al.  The Yale human grasping dataset: Grasp, object, and task data in household and machine shop environments , 2015, Int. J. Robotics Res..

[60]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[61]  Vincent Lepetit,et al.  Efficiently Creating 3D Training Data for Fine Hand Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Vittorio Ferrari,et al.  Learning Single-Image 3D Reconstruction by Generative Modelling of Shape, Pose and Shading , 2019, International Journal of Computer Vision.

[64]  Peter K. Allen,et al.  Graspit! A versatile simulator for robotic grasping , 2004, IEEE Robotics & Automation Magazine.

[65]  Antonis A. Argyros,et al.  Towards force sensing from vision: Observing hand-object interactions to infer manipulation forces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[67]  Vincent Lepetit,et al.  BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Junsong Yuan,et al.  Hand PointNet: 3D Hand Pose Estimation Using Point Sets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[70]  Deva Ramanan,et al.  3D Hand Pose Detection in Egocentric RGB-D Images , 2014, ECCV Workshops.

[71]  Christian Theobalt,et al.  Real-Time Hand Tracking Under Occlusion from an Egocentric RGB-D Sensor , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[72]  Dimitrios Tzionas,et al.  Capturing Hand Motion with an RGB-D Sensor, Fusing a Generative Model with Salient Points , 2014, GCPR.

[73]  Yaser Sheikh,et al.  Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.