DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition

Abstract Although deep learning has achieved promising progress recently, action recognition remains a challenging task, due to cluttered backgrounds, diverse scenes, occlusions, viewpoint variations and camera motions. In this paper, we propose a novel deep learning model to capture the spatial and temporal patterns of human actions from videos. Sample representation learner is proposed to extract the video-level temporal feature, which combines the sparse temporal sampling and long-range temporal learning to form an efficient and effective training strategy. To boost the effectiveness and robustness of modeling long-range action recognition, a Densely-connected Bi-directional LSTM (DB-LSTM) network is novelly proposed to model the visual and temporal associations in both forward and backward directions. They are stacked and integrated with the dense skip-connections to improve the capability of temporal pattern modeling. Two modalities from appearance and motion are integrated with a fusion module to further improve the performance. Experiments conducted on two benchmark datasets, UCF101 and HMDB51, demonstrate that the proposed DB-LSTM model achieves promising performance, which outperforms the state-of-the-art approaches for action recognition.

[1]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[2]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[3]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[4]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[5]  Qiuqi Ruan,et al.  Spatial-temporal pyramid based Convolutional Neural Network for action recognition , 2019, Neurocomputing.

[6]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[7]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[9]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[10]  Steven S. Beauchemin,et al.  The computation of optical flow , 1995, CSUR.

[11]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Limin Wang,et al.  MoFAP: A Multi-level Representation for Action Recognition , 2015, International Journal of Computer Vision.

[13]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[17]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[20]  Xiang Li,et al.  Densely Connected Bidirectional LSTM with Applications to Sentence Classification , 2018, NLPCC.

[21]  Gang Wang,et al.  Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[22]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[25]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.