Data-driven Recovery of Hand Depth using Conditional Regressive Random Forest on Stereo Images

Hand pose is emerging as an important interface for human-computer interaction. This paper presents a data-driven method to estimate a high-quality depth map of a hand from a stereoscopic camera input by introducing a novel superpixelbased regression framework that takes advantage of the smoothness of the depth surface of the hand. To this end, we introduce Conditional Regressive Random Forest (CRRF), a method that combines a Conditional Random Field (CRF) and a Regressive Random Forest (RRF) to model the mapping from a stereo RGB image pair to a depth image. The RRF provides a unary term that adaptively selects different stereo-matching measures as it implicitly determines matching pixels in a coarse-to-fine manner. While the RRF makes depth prediction for each super-pixel independently, the CRF unifies the prediction of depth by modeling pair-wise interactions between adjacent superpixels. Experimental results show that CRRF can generate a depth image more accurately than the leading contemporary techniques using an inexpensive stereo camera.

[1]  Neil A. Thacker,et al.  The Bhattacharyya metric as an absolute similarity measure for frequency coded data , 1998, Kybernetika.

[2]  Zhengyou Zhang,et al.  Flexible camera calibration by viewing a plane from unknown orientations , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[3]  Gary R. Bradski,et al.  Stereo based gesture recognition invariant to 3D pose and lighting , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[4]  H. Hirschmuller Accurate and efficient stereo processing by semi-global matching and mutual information , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[5]  Abdesselam Bouzerdoum,et al.  Skin segmentation using color pixel classification: analysis and comparison , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Ashutosh Saxena,et al.  Learning Depth from Single Monocular Images , 2005, NIPS.

[7]  Heiko Hirschmüller,et al.  Evaluation of Cost Functions for Stereo Matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  S. Todorovic,et al.  Random Forest Random Field , 2010 .

[9]  Stephen Gould,et al.  Single image depth estimation from predicted semantic labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Antonis A. Argyros,et al.  Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints , 2011, 2011 International Conference on Computer Vision.

[11]  Antonis A. Argyros,et al.  Tracking the articulated motion of two strongly interacting hands , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[13]  P. K. Mishra Superior Skin Color Model using Multiple of Gaussian Mixture Model , 2012 .

[14]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Min Sun,et al.  Conditional regression forests for human pose estimation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Mokhtar M. Hasan,et al.  Novel Algorithm for Skin Color Based Segmentation Using Mixture of Gmms , 2013 .

[17]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[18]  Nasser Kehtarnavaz,et al.  Real-time robust vision-based hand gesture recognition using stereo images , 2013, Journal of Real-Time Image Processing.

[19]  Rob Fergus,et al.  Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[20]  David Sweeney,et al.  Learning to be a depth camera for close-range human capture and interaction , 2014, ACM Trans. Graph..

[21]  Eduardo Alonso,et al.  Quantized Census for Stereoscopic Image Matching , 2014, 2014 2nd International Conference on 3D Vision.

[22]  Guosheng Lin,et al.  Deep convolutional neural fields for depth estimation from a single image , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Mandy Eberhart,et al.  Decision Forests For Computer Vision And Medical Image Analysis , 2016 .

[24]  Eduardo Alonso,et al.  HandyDepth: Example-based stereoscopic hand depth estimation using Eigen Leaf Node Features , 2016, 2016 International Conference on Systems, Signals and Image Processing (IWSSIP).

[25]  Ruigang Yang,et al.  Real-Time Simultaneous Pose and Shape Estimation for Articulated Objects Using a Single Depth Camera , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Guoliang Fan,et al.  Articulated and Generalized Gaussian Kernel Correlation for Human Pose Estimation , 2016, IEEE Transactions on Image Processing.

[27]  Eduardo Alonso,et al.  Conditional Regressive Random Forest Stereo-Based Hand Depth Recovery , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).