Hand tracking from monocular RGB with dense semantic labels

Recent years have seen a renewed interest in RGB-based hand tracking, as opposed to the depth-based tracking that has dominated the field since the introduction of commodity depth cameras. This trend has been driven by the ability of convolutional neural networks to process large quantities of image data. In this paper, we propose an approach to hand tracking that operates on sets of dense semantic labels. A full pipeline for RGB-based hand tracking is presented. This pipeline uses convolutional neural networks to produce a per-pixel semantic map of the scene before optimising the state of a kinematic model according to this semantic map using a tracking algorithm based on Different Evolution (DE). This technique allows us to simultaneously localise the hand in 3D space and recover the pose, and requires only monocular RGB input. We apply our technique to a benchmark dataset, reporting semantic segmentation and 3D pose tracking results, which we compare to the current state of the art. We also compare our DE-based algorithm to an equivalent one based on Particle Swarm Optimisation (PSO) and show that it is superior.

[1]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[2]  Ying Wu,et al.  3D model-based hand tracking using stochastic direct search method , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[3]  Michael I. Mandel,et al.  Visual Hand Tracking Using Nonparametric Belief Propagation , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[4]  Björn Stenger,et al.  Model-based hand tracking using a hierarchical Bayesian filter , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  N.D. Georganas,et al.  Real-time Vision-based Hand Gesture Recognition Using Haar-like Features , 2007, 2007 IEEE Instrumentation & Measurement Technology Conference IMTC 2007.

[6]  Sebastian Thrun,et al.  Real time motion capture using a single time-of-flight camera , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Ruigang Yang,et al.  Accurate 3D pose estimation from a single depth image , 2011, 2011 International Conference on Computer Vision.

[8]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[9]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[10]  Hans-Peter Seidel,et al.  A data-driven approach for real-time full body pose reconstruction from a depth camera , 2011, 2011 International Conference on Computer Vision.

[11]  Andrew W. Fitzgibbon,et al.  The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Min Sun,et al.  Conditional regression forests for human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Tae-Kyun Kim,et al.  Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Antti Oulasvirta,et al.  Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Andrew W. Fitzgibbon,et al.  Metric Regression Forests for Human Pose Estimation , 2013, BMVC.

[17]  Christian Wolf,et al.  Hand Segmentation with Structured Convolutional Learning , 2014, ACCV.

[18]  Chen Qian,et al.  Realtime and Robust Hand Tracking from Depth , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Pedro H. O. Pinheiro,et al.  Weakly Supervised Semantic Segmentation with Convolutional Networks , 2014, ArXiv.

[20]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[21]  Tae-Kyun Kim,et al.  Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Andrew W. Fitzgibbon,et al.  Accurate, Robust, and Flexible Real-time Hand Tracking , 2015, CHI.

[23]  Andrea Tagliasacchi,et al.  Robust Articulated-ICP for Real-Time Hand Tracking , 2015 .

[24]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[25]  Jian Sun,et al.  Cascaded hand pose regression , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Dongnian Li,et al.  Combining Differential Evolution with Particle Filtering for Articulated Hand Tracking from Single Depth Images , 2015 .

[27]  Vincent Lepetit,et al.  Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[28]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  George Papandreou,et al.  Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Luc Van Gool,et al.  Hand Pose Estimation from Local Surface Normals , 2016, ECCV.

[31]  Andrew W. Fitzgibbon,et al.  Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences , 2016, ACM Trans. Graph..

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Nassir Navab,et al.  Deeper Depth Prediction with Fully Convolutional Residual Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[34]  Yichen Wei,et al.  Model-Based Deep Hand Pose Estimation , 2016, IJCAI.

[35]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[36]  Qi Ye,et al.  Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation , 2016, ECCV.

[37]  Mingliang Chen,et al.  3D Hand Pose Tracking and Estimation Using Stereo Matching , 2016, ArXiv.

[38]  Antonis A. Argyros,et al.  Back to RGB: 3D Tracking of Hands and Hand-Object Interactions Based on Short-Baseline Stereo , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[39]  Sergio Escalera,et al.  End-to-end Global to Local CNN Learning for Hand Pose Recovery in Depth data , 2017, ArXiv.

[40]  Christian Wolf,et al.  Hand pose estimation through semi-supervised and weakly-supervised learning , 2015, Comput. Vis. Image Underst..

[41]  Daniel Thalmann,et al.  3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation from Single Depth Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Liang Chang,et al.  Hand3D: Hand Pose Estimation using 3D Neural Network , 2017, ArXiv.

[45]  Fei Qiao,et al.  Region ensemble network: Improving convolutional network for hand pose estimation , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[46]  Didier Stricker,et al.  DeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning from Synthetic Depth , 2018, 2018 International Conference on 3D Vision (3DV).

[47]  Kyoung Mu Lee,et al.  V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Shanxin Yuan,et al.  RGB-based 3D Hand Pose Estimation via Privileged Learning with Depth Images , 2018, ArXiv.

[49]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[51]  Didier Stricker,et al.  Structure-Aware 3D Hand Pose Regression from a Single Depth Image , 2018, EuroVR.

[52]  Jianfei Cai,et al.  Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images , 2018, ECCV.

[53]  Sergio Escalera,et al.  Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  BatchNorm,et al.  Cross-modal Deep Variational Hand Pose Estimation , 2018 .

[55]  Antonis A. Argyros,et al.  Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[56]  Ali Borji,et al.  Analysis of Hand Segmentation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Tae-Kyun Kim,et al.  SHPR-Net: Deep Semantic Hand Pose Regression From Point Clouds , 2018, IEEE Access.

[58]  Vincent Lepetit,et al.  Domain Transfer for 3D Pose Estimation from Color Images without Manual Annotations , 2018, ACCV.

[59]  Thomas Wolf,et al.  Monocular RGB Hand Pose Inference from Unsupervised Refinable Nets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[60]  Vivek Kumar Singh,et al.  FinSeg: Finger Parts Semantic Segmentation using Multi-scale Feature Maps Aggregation of FCN , 2019, VISIGRAPP.

[61]  Tae-Kyun Kim,et al.  Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Antonis A. Argyros,et al.  Learning to Infer the Depth Map of a Hand from its Color Image , 2018, 2020 International Joint Conference on Neural Networks (IJCNN).