论文信息 - DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition

DB-LSTM: Densely-connected Bi-directional LSTM for human action recognition

Abstract Although deep learning has achieved promising progress recently, action recognition remains a challenging task, due to cluttered backgrounds, diverse scenes, occlusions, viewpoint variations and camera motions. In this paper, we propose a novel deep learning model to capture the spatial and temporal patterns of human actions from videos. Sample representation learner is proposed to extract the video-level temporal feature, which combines the sparse temporal sampling and long-range temporal learning to form an efficient and effective training strategy. To boost the effectiveness and robustness of modeling long-range action recognition, a Densely-connected Bi-directional LSTM (DB-LSTM) network is novelly proposed to model the visual and temporal associations in both forward and backward directions. They are stacked and integrated with the dense skip-connections to improve the capability of temporal pattern modeling. Two modalities from appearance and motion are integrated with a fusion module to further improve the performance. Experiments conducted on two benchmark datasets, UCF101 and HMDB51, demonstrate that the proposed DB-LSTM model achieves promising performance, which outperforms the state-of-the-art approaches for action recognition.

[1] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[2] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[3] Thomas Mensink,et al. Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[4] Mubarak Shah,et al. A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[5] Qiuqi Ruan,et al. Spatial-temporal pyramid based Convolutional Neural Network for action recognition , 2019, Neurocomputing.

[6] Limin Wang,et al. Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[7] Ivan Laptev,et al. On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[8] Berthold K. P. Horn,et al. Determining Optical Flow , 1981, Other Conferences.

[9] Cees Snoek,et al. VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[10] Steven S. Beauchemin,et al. The computation of optical flow , 1995, CSUR.

[11] Cordelia Schmid,et al. Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Limin Wang,et al. MoFAP: A Multi-level Representation for Action Recognition , 2015, International Journal of Computer Vision.

[13] Xi Wang,et al. Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[14] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Luc Van Gool,et al. An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[17] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[18] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[19] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[20] Xiang Li,et al. Densely Connected Bidirectional LSTM with Applications to Sentence Classification , 2018, NLPCC.

[21] Gang Wang,et al. Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[22] Cordelia Schmid,et al. Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23] Ronen Basri,et al. Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24] J.K. Aggarwal,et al. Human activity analysis , 2011, ACM Comput. Surv..

[25] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.