Robust 3D Hand Pose Estimation From Single Depth Images Using Multi-View CNNs

Articulated hand pose estimation is one of core technologies in human–computer interaction. Despite the recent progress, most existing methods still cannot achieve satisfactory performance, partly due to the difficulty of the embedded high-dimensional nonlinear regression problem. Most existing data-driven methods directly regress 3D hand pose from 2D depth image, which cannot fully utilize the depth information. In this paper, we propose a novel multi-view convolutional neural network (CNN)-based approach for 3D hand pose estimation. To better exploit 3D information in the depth image, we project the point cloud generated from the query depth image onto multiple views of two projection settings and integrate them for more robust estimation. Multi-view CNNs are trained to learn the mapping from projected images to heat-maps, which reflect probability distributions of joints on each view. These multi-view heat-maps are then fused to estimate the optimal 3D hand pose with learned pose priors, and the unreliable information in multi-view heat-maps is suppressed using a view selection method. Experimental results show that the proposed method is superior to the state-of-the-art methods on two challenging data sets. Furthermore, a cross-data set experiment also validates that our proposed approach has good generalization ability.

[1]  Vincent Lepetit,et al.  Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[2]  Haibin Ling,et al.  3D Hand Pose Estimation Using Randomized Decision Forest with Segmentation Index Points , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Junsong Yuan,et al.  Robust Part-Based Hand Gesture Recognition Using Kinect Sensor , 2013, IEEE Transactions on Multimedia.

[4]  Karthik Ramani,et al.  A Collaborative Filtering Approach to Real-Time Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Pascal Fua,et al.  Tracking Interacting Objects Using Intertwined Flows , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Daniel Thalmann,et al.  Resolving Ambiguous Hand Pose Predictions by Exploiting Part Correlations , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  Antonis A. Argyros,et al.  Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[8]  Andrea Tagliasacchi,et al.  Robust Articulated-ICP for Real-Time Hand Tracking , 2015 .

[9]  Qi Ye,et al.  Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation , 2016, ECCV.

[10]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  James M. Van Verth,et al.  Essential Mathematics for Games and Interactive Applications: A Programmer's Guide , 2004 .

[12]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[13]  Luc Van Gool,et al.  Direction matters: hand pose estimation from local surface normals , 2016, ArXiv.

[14]  Jing Zhang,et al.  Action Recognition From Depth Maps Using Deep Convolutional Neural Networks , 2016, IEEE Transactions on Human-Machine Systems.

[15]  Luc Van Gool,et al.  Motion Capture of Hands in Action Using Discriminative Salient Points , 2012, ECCV.

[16]  Andrew W. Fitzgibbon,et al.  The Vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Pascal Fua,et al.  Non-Markovian Globally Consistent Multi-object Tracking , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[20]  Lale Akarun,et al.  Hand Pose Estimation and Hand Shape Classification Using Multi-layered Randomized Decision Forests , 2012, ECCV.

[21]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[23]  Rogério Schmidt Feris,et al.  The isometric self-organizing map for 3D hand pose estimation , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[24]  Yichen Wei,et al.  Model-Based Deep Hand Pose Estimation , 2016, IJCAI.

[25]  Vincent Lepetit,et al.  Efficiently Creating 3D Training Data for Fine Hand Pose Estimation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Quan-Sen Sun,et al.  Geodesic Invariant Feature: A Local Descriptor in Depth , 2015, IEEE Transactions on Image Processing.

[27]  Lale Akarun,et al.  Hierarchically constrained 3D hand pose estimation using regression forests from single frame depth data , 2014, Pattern Recognit. Lett..

[28]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Yi Yang,et al.  Depth-Based Hand Pose Estimation: Data, Methods, and Challenges , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Jian Sun,et al.  Cascaded hand pose regression , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Tae-Kyun Kim,et al.  Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Karthik Ramani,et al.  DeepHand: Robust Hand Pose Estimation by Completing a Matrix Imputed with Deep Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Li Cheng,et al.  Efficient Hand Pose Estimation from a Single Depth Image , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Baoxin Li,et al.  Fusing disparate object signatures for salient object detection in video , 2017, Pattern Recognit..

[35]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Tae-Kyun Kim,et al.  Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Chen Qian,et al.  Realtime and Robust Hand Tracking from Depth , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Konrad Schindler,et al.  A Generalisation of the ICP Algorithm for Articulated Bodies , 2008, BMVC.

[39]  Antti Oulasvirta,et al.  Fast and robust hand tracking using detection-guided optimization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Pietro Perona,et al.  Cascaded pose regression , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[41]  Ruigang Yang,et al.  Real-Time Simultaneous Pose and Shape Estimation for Articulated Objects Using a Single Depth Camera , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Jonathan Tompson,et al.  Learning Human Pose Estimation Features with Convolutional Networks , 2013, ICLR.

[43]  Xianming Liu,et al.  Greedy Batch-Based Minimum-Cost Flows for Tracking Multiple Objects , 2017, IEEE Transactions on Image Processing.

[44]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[45]  Qiang Ji,et al.  Data-Free Prior Model for Upper Body Pose Estimation and Tracking , 2013, IEEE Transactions on Image Processing.

[46]  Tae-Kyun Kim,et al.  Opening the Black Box: Hierarchical Sampling Optimization for Estimating Human Hand Pose , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[47]  Guoliang Fan,et al.  Articulated and Generalized Gaussian Kernel Correlation for Human Pose Estimation , 2016, IEEE Transactions on Image Processing.

[48]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[49]  Vincent Lepetit,et al.  Training a Feedback Loop for Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Ying Wu,et al.  Multimodal partial estimates fusion , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[51]  Leonidas J. Guibas,et al.  Volumetric and Multi-view CNNs for Object Classification on 3D Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[53]  Ying Wu,et al.  Hand modeling, analysis and recognition , 2001, IEEE Signal Process. Mag..

[54]  Daniel Thalmann,et al.  3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation from Single Depth Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Daniel Thalmann,et al.  Real-Time 3D Hand Pose Estimation with 3D Convolutional Neural Networks , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Andrew W. Fitzgibbon,et al.  Accurate, Robust, and Flexible Real-time Hand Tracking , 2015, CHI.

[57]  Daniel Thalmann,et al.  Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).