Classifying and visualizing motion capture sequences using deep neural networks

The gesture recognition using motion capture data and depth sensors has recently drawn more attention in vision recognition. Currently most systems only classify dataset with a couple of dozens different actions. Moreover, feature extraction from the data is often computational complex. In this paper, we propose a novel system to recognize the actions from skeleton data with simple, but effective, features using deep neural networks. Features are extracted for each frame based on the relative positions of joints (PO), temporal differences (TD), and normalized trajectories of motion (NT). Given these features a hybrid multi-layer perceptron is trained, which simultaneously classifies and reconstructs input data. We use deep autoencoder to visualize learnt features. The experiments show that deep neural networks can capture more discriminative information than, for instance, principal component analysis can. We test our system on a public database with 65 classes and more than 2,000 motion sequences. We obtain an accuracy above 95% which is, to our knowledge, the state of the art result for such a large dataset.

[1]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[2]  Meinard Müller,et al.  Motion templates for automatic classification and retrieval of motion capture data , 2006, SCA '06.

[3]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[4]  Mario Fernando Montenegro Campos,et al.  Distance matrices as invariant features for classifying MoCap data , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[5]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[6]  Stefano Soatto,et al.  Flexible Dictionaries for Action Classification , 2008 .

[7]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[8]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[10]  Amos Storkey,et al.  Advances in Neural Information Processing Systems 20 , 2007 .

[11]  Tido Röder,et al.  Documentation Mocap Database HDM05 , 2007 .

[12]  Darko Kirovski,et al.  Real-time classification of dance gestures from skeleton animation , 2011, SCA '11.

[13]  Markus Koskela,et al.  Classification of RGB-D and Motion Capture Sequences Using Extreme Learning Machine , 2013, SCIA.

[14]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[15]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[16]  Hyunsook Chung,et al.  Conditional random field-based gesture recognition with depth information , 2013 .

[17]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[18]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[19]  Mathieu Barnachon,et al.  A real-time system for motion retrieval and interpretation , 2013, Pattern Recognit. Lett..

[20]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[21]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[22]  Xin Zhao,et al.  Human action recognition based on semi-supervised discriminant analysis with global constraint , 2013, Neurocomputing.

[23]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.