Nonparametric Feature Matching Based Conditional Random Fields for Gesture Recognition from Multi-Modal Video

We present a new gesture recognition method that is based on the conditional random field (CRF) model using multiple feature matching. Our approach solves the labeling problem, determining gesture categories and their temporal ranges at the same time. A generative probabilistic model is formalized and probability densities are nonparametrically estimated by matching input features with a training dataset. In addition to the conventional skeletal joint-based features, the appearance information near the active hand in an RGB image is exploited to capture the detailed motion of fingers. The estimated likelihood function is then used as the unary term for our CRF model. The smoothness term is also incorporated to enforce the temporal coherence of our solution. Frame-wise recognition results can then be obtained by applying an efficient dynamic programming technique. To estimate the parameters of the proposed CRF model, we incorporate the structured support vector machine (SSVM) framework that can perform efficient structured learning by using large-scale datasets. Experimental results demonstrate that our method provides effective gesture recognition results for challenging real gesture datasets. By scoring 0.8563 in the mean Jaccard index, our method has obtained the state-of-the-art results for the gesture recognition track of the 2014 ChaLearn Looking at People (LAP) Challenge.

[1]  Ying Wu,et al.  Learning Maximum Margin Temporal Warping for Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[2]  Benjamin Schrauwen,et al.  Sign Language Recognition Using Convolutional Neural Networks , 2014, ECCV Workshops.

[3]  Xiaodong Yang,et al.  EigenJoints-based action recognition using Naïve-Bayes-Nearest-Neighbor , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[4]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Tido Röder,et al.  Efficient content-based retrieval of motion capture data , 2005, SIGGRAPH 2005.

[6]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Qing Zhang,et al.  A Survey on Human Motion Analysis from Depth Data , 2013, Time-of-Flight and Depth Imaging.

[8]  Di Wu,et al.  Multi-modality Gesture Detection and Recognition with Un-supervision, Randomization and Discrimination , 2014, ECCV Workshops.

[9]  Andrew Blake,et al.  Efficient Human Pose Estimation from Single Depth Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[11]  Tae-Seong Kim,et al.  Recognition of Human Home Activities via Depth Silhouettes and ℜ Transformation for Smart Homes , 2012 .

[12]  R. Battison,et al.  Lexical Borrowing in American Sign Language , 1978 .

[13]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  Benjamin Klein,et al.  Discriminative Ferns Ensemble for Hand Pose Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Olga Veksler,et al.  Fast approximate energy minimization via graph cuts , 2001, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[17]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[18]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Meinard Müller,et al.  Efficient content-based retrieval of motion capture data , 2005, SIGGRAPH '05.

[20]  Lale Akarun,et al.  Gesture Recognition Using Template Based Random Forest Classifiers , 2014, ECCV Workshops.

[21]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[22]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[23]  Ling Shao,et al.  Deep Dynamic Neural Networks for Gesture Segmentation and Recognition , 2014, ECCV Workshops.

[24]  Limin Wang,et al.  Action and Gesture Temporal Spotting with Super Vector Representation , 2014, ECCV Workshops.

[25]  Fernando De la Torre,et al.  Generalized time warping for multi-modal alignment of human motion , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[27]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[28]  Christian Wolf,et al.  Multi-scale Deep Learning for Gesture Detection and Localization , 2014, ECCV Workshops.

[29]  Helena M. Mentis,et al.  Instructing people for training gestural interactive systems , 2012, CHI.

[30]  Radu Horaud,et al.  Continuous Gesture Recognition from Articulated Poses , 2014, ECCV Workshops.

[31]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[32]  Cristian Sminchisescu,et al.  Conditional models for contextual human motion recognition , 2006, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[33]  Svetlana Lazebnik,et al.  Superparsing , 2010, International Journal of Computer Vision.

[34]  Luc Van Gool,et al.  Gesture Recognition Portfolios for Personalization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Camille Monnier,et al.  A Multi-scale Boosted Detector for Efficient and Robust Gesture Recognition , 2014, ECCV Workshops.

[37]  Dimitris N. Metaxas,et al.  ASL recognition based on a coupling between HMMs and 3D motion analysis , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[38]  Fernando De la Torre,et al.  Joint segmentation and classification of human actions in video , 2011, CVPR 2011.

[39]  T D Albright,et al.  Visual motion perception. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Sergio Escalera,et al.  ChaLearn Looking at People Challenge 2014: Dataset and Results , 2014, ECCV Workshops.

[41]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[42]  J. E. Kelley,et al.  The Cutting-Plane Method for Solving Convex Programs , 1960 .

[43]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[44]  Antonio Torralba,et al.  Nonparametric Scene Parsing via Label Transfer , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Larry S. Davis,et al.  Recognizing actions by shape-motion prototype trees , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[46]  Nanning Zheng,et al.  Concurrent Action Detection with Structural Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[47]  Trevor Darrell,et al.  Hidden Conditional Random Fields for Gesture Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[48]  Andrew Zisserman,et al.  Domain-Adaptive Discriminative One-Shot Learning of Gestures , 2014, ECCV.

[49]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[50]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[51]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[52]  Ju Yong Chang Nonparametric Gesture Labeling from Multi-modal Data , 2014, ECCV Workshops.

[53]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[54]  Alexei A. Efros,et al.  Recovering Surface Layout from an Image , 2007, International Journal of Computer Vision.

[55]  Bharti Bansal,et al.  Gesture Recognition: A Survey , 2016 .

[56]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.