Robust human action recognition via long short-term memory

The long short-term memory (LSTM) neural network utilizes specialized modulation mechanisms to store information for extended periods of time. It is thus potentially well-suited for complex visual processing, where the current video frame must be considered in the context of past frames. Recent studies have indeed shown that LSTM can effectively recognize and classify human actions (e.g., running, hand waving) in video data; however, these results were achieved under somewhat restricted settings. In this effort, we seek to demonstrate that LSTM's performance remains robust even as experimental conditions deteriorate. Specifically, we show that classification accuracy exhibits graceful degradation when the LSTM network is faced with (a) lower quantities of available training data, (b) tighter deadlines for decision making (i.e., shorter available input data sequences) and (c) poorer video quality (resulting from noise, dropped frames or reduced resolution). We also clearly demonstrate the benefits of memory for video processing, particularly, under high noise or frame drop rates. Our study is thus an initial step towards demonstrating LSTM's potential for robust action recognition in real-world scenarios.

[1]  Christopher G. Harris,et al.  A Combined Corner and Edge Detector , 1988, Alvey Vision Conference.

[2]  Michael I. Jordan Attractor dynamics and parallelism in a connectionist sequential machine , 1990 .

[3]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[4]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Christian Wolf,et al.  Sparse shift-invariant representation of local 2D patterns and sequence learning for human action recognition , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[6]  Sawsan M. Mahmoud Identification and prediction of abnormal behaviour activities of daily living in intelligent environments , 2012 .

[7]  Christian Wolf,et al.  Action Classification in Soccer Videos with Long Short-Term Memory Recurrent Neural Networks , 2010, ICANN.

[8]  Thomas G. Dietterich Machine Learning for Sequential Data: A Review , 2002, SSPR/SPR.

[9]  Jürgen Schmidhuber,et al.  Classifying Unprompted Speech by Retraining LSTM Nets , 2005, ICANN.

[10]  James A. Reggia,et al.  A generalized LSTM-like training algorithm for second-order recurrent neural networks , 2012, Neural Networks.

[11]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[12]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[14]  Michael J. Frank,et al.  Interactions between frontal cortex and basal ganglia in working memory: A computational model , 2001, Cognitive, affective & behavioral neuroscience.

[15]  Zhe Zhang Vision-based Human Action Recognition: A Sparse Representation Perspective , 2012 .

[16]  Anni Cai,et al.  Comparing Evaluation Protocols on the KTH Dataset , 2010, HBU.

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Rama Chellappa,et al.  Machine Recognition of Human Activities: A Survey , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Nicu Sebe,et al.  Systematic Evaluation of Spatio-Temporal Features on Comparative Video Challenges , 2010, ACCV Workshops.

[21]  Allan Hanbury,et al.  FeEval A Dataset for Evaluation of Spatio-temporal Local Features , 2010, 2010 20th International Conference on Pattern Recognition.

[22]  Christian Wolf,et al.  Spatio-Temporal Convolutional Sparse Auto-Encoder for Sequence Classification , 2012, BMVC.