Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition

Over the last few years, with the immense popularity of the Kinect, there has been renewed interest in developing methods for human gesture and action recognition from 3D skeletal data. A number of approaches have been proposed to extract representative features from 3D skeletal data, most commonly hard wired geometric or bio-inspired shape context features. We propose a hierarchial dynamic framework that first extracts high level skeletal joints features and then uses the learned representation for estimating emission probability to infer action sequences. Currently gaussian mixture models are the dominant technique for modeling the emission distribution of hidden Markov models. We show that better action recognition using skeletal features can be achieved by replacing gaussian mixture models by deep neural networks that contain many layers of features to predict probability distributions over states of hidden Markov models. The framework can be easily extended to include a ergodic state to segment and recognize actions simultaneously.

[1]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[2]  Meinard Müller,et al.  Motion templates for automatic classification and retrieval of motion capture data , 2006, SCA '06.

[3]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[4]  Geoffrey E. Hinton,et al.  Modeling Human Motion Using Binary Latent Variables , 2006, NIPS.

[5]  Ramakant Nevatia,et al.  Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost , 2006, ECCV.

[6]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[10]  Hans-Peter Seidel,et al.  Efficient and Robust Annotation of Motion Capture Data , 2009 .

[11]  Thierry Artières,et al.  Neural conditional random fields , 2010, AISTATS.

[12]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[13]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[14]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[15]  Xiaodong Yang,et al.  EigenJoints-based action recognition using Naïve-Bayes-Nearest-Neighbor , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[16]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[17]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[18]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Sebastian Nowozin,et al.  Action Points: A Representation for Low-latency Online Human Action Recognition , 2012 .

[20]  Luc Van Gool,et al.  Coupled Action Recognition and Pose Estimation from Multiple Views , 2012, International Journal of Computer Vision.

[21]  Isabelle Guyon,et al.  ChaLearn gesture challenge: Design and first results , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[22]  Joseph J. LaViola,et al.  Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition , 2013, International Journal of Computer Vision.

[23]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Helena M. Mentis,et al.  Instructing people for training gestural interactive systems , 2012, CHI.

[25]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[26]  Hanqing Lu,et al.  Fusing multi-modal features for gesture recognition , 2013, ICMI '13.

[27]  Rolf Dach,et al.  Technical Report 2012 , 2013 .

[28]  Quan Z. Sheng,et al.  Online human gesture recognition from motion data streams , 2013, ACM Multimedia.

[29]  Sergio Escalera,et al.  Multi-modal gesture recognition challenge 2013: dataset and results , 2013, ICMI '13.

[30]  Ruzena Bajcsy,et al.  Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[31]  Ling Shao,et al.  Enhanced Computer Vision With Microsoft Kinect Sensor: A Review , 2013, IEEE Transactions on Cybernetics.

[32]  Sebastian Nowozin,et al.  A Non-parametric Bayesian Network Prior of Human Pose , 2013, 2013 IEEE International Conference on Computer Vision.