Action Recognition Using 3D CNN and LSTM for Video Analytics

With the advent of growing digital technology, large amount of video data is being generated, making video analytics a promising technology. Human activity recognition in videos is currently receiving increased attention and activity recognition systems are a large field of research and development with a focus on advanced machine learning algorithms, innovations in the field of hardware architecture, and on decreasing the costs of monitoring while increasing safety (Guo and Lai in Pattern Recognit 47:3343–3361, 2014, [1]). The existing system for action recognition involves using Convolutional Neural Networks (CNN). Videos are taken as a sequence of frames and frame-level CNN sequence features generated are fed to Long Short-Term Memory (LSTM) model for video recognition. However, the abovementioned methodology takes frame-level CNN sequence features as input for LSTM, which may fail to capture the rich motion information from adjacent frames or multiple clips. It is important to consider adjacent frames that allow for salient features, instead of mapping an entire frame into a static representation. Thereby, to mitigate this drawback, a new methodology is proposed wherein initially, saliency-aware methods are applied to generate saliency-aware videos. Then, an end-to-end pipeline is designed by integrating 3D CNN with LSTM, followed by a time series pooling layer and a softmax layer to predict the activities in video.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Deva Ramanan,et al.  Attentional Pooling for Action Recognition , 2017, NIPS.

[4]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[5]  Majid Mirmehdi,et al.  Detecting the Moment of Completion: Temporal Models for Localising Action Completion , 2017, ArXiv.

[6]  Larry S. Davis,et al.  Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[8]  Xiaodong Liu,et al.  A speculative approach to spatial-temporal efficiency with multi-objective optimization in a heterogeneous cloud environment , 2016, Secur. Commun. Networks.

[9]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[10]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[11]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[12]  Cordelia Schmid,et al.  Human Action Localization with Sparse Spatial Supervision , 2017 .

[13]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Heng Tao Shen,et al.  Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition , 2017, IEEE Signal Processing Letters.

[15]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Mubarak Shah,et al.  Unsupervised Action Discovery and Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Guodong Guo,et al.  A survey on still image based human action recognition , 2014, Pattern Recognit..

[18]  Fatih Murat Porikli,et al.  Saliency-aware geodesic video object segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).