Learning motion-difference features using Gaussian restricted Boltzmann machines for efficient human action recognition

Learning visual words from video frames is challenging because deciding which word to assign to each subset of frames is a difficult task. For example, two similar frames may have different meanings in describing human actions such as starting to run and starting to walk. In order to associate richer information to vector-quantization and generate visual words, several approaches have been proposed recently that use complex algorithms to extract or learn spatio-temporal features from 3-D volumes of video frames. In this paper, we propose an efficient method to use Gaussian RBMs for learning motion-difference features from actions in videos. The difference between two video frames is defined by a subtraction function of one frame by another that preserves positive and negative changes, thus creating a simple spatio-temporal saliency map for an action. This subtraction function removes, by construction, the common shapes and background images that should not be relevant for action learning and recognition, and highlights the movement patterns in space, making it easier to learn the actions from such saliency maps using shallow feature learning models such as RBMs. In the experiments reported in this paper, we used a Gaussian restricted Boltzmann machine to learn the actions from saliency maps of different motion images. Despite its simplicity, the motion-difference method achieved very good performance in benchmark datasets, specifically the Weizmann dataset (98.81%) and the KTH dataset (88.89%). A comparative analysis with hand-crafted and learned features using similar classifiers indicates that motion-difference can be competitive and very efficient.

[1]  Yang Wang,et al.  Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[3]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[4]  Geoffrey E. Hinton,et al.  Exponential Family Harmoniums with an Application to Information Retrieval , 2004, NIPS.

[5]  Yang Wang,et al.  Incremental EM for Probabilistic Latent Semantic Analysis on Human Action Recognition , 2009, 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance.

[6]  Shaogang Gong,et al.  Action categorization by structural probabilistic latent semantic analysis , 2010, Comput. Vis. Image Underst..

[7]  Yihong Gong,et al.  Human action detection by boosting efficient motion features , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[8]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[9]  Larry S. Davis,et al.  Recognizing actions by shape-motion prototype trees , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[11]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[13]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[14]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[15]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[16]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[17]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[19]  Klamer Schutte,et al.  Recognition of 48 Human Behaviors from Video , 2012 .

[20]  W. Eric L. Grimson,et al.  Adaptive background mixture models for real-time tracking , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[21]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[22]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[23]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[24]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[26]  Bo Chen,et al.  Deep Learning of Invariant Spatio-Temporal Features from Video , 2010 .

[27]  Christopher K. I. Williams,et al.  The Shape Boltzmann Machine: A Strong Model of Object Shape , 2012, International Journal of Computer Vision.

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .