Correlational Convolutional LSTM for human action recognition

Abstract In light of recent exponential growth of video data, the need for automated video processing has increased substantially. To learn the intrinsic structure of video data, many representation approaches have been proposed, focusing on learning the spatial features and time dependencies, while, motion features are hand-crafted and left out of the learning process. In this work, we present an extended version of the LSTM units named C2LSTM in which the motion data are perceived as well as the spatial features and temporal dependencies. We leverage convolution and correlation operators to credit both the spatial and motion structure of the video data. Furthermore, a deep network is designed for human action recognition using the proposed units. The network is evaluated on the two well-known benchmarks, UCF101 and HMDB51. The results confirm the potency of C2LSTM to capture motion as well as spatial features and time dependencies.

[1]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[2]  Heng Tao Shen,et al.  Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition , 2017, IEEE Signal Processing Letters.

[3]  Yanning Zhang,et al.  Going deeper with two-stream ConvNets for action recognition in video surveillance , 2017, Pattern Recognit. Lett..

[4]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[5]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[8]  Michael S. Lew,et al.  Deep learning for visual understanding: A review , 2016, Neurocomputing.

[9]  Bahjat Safadi,et al.  Learned features versus engineered features for semantic video indexing , 2015, 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI).

[10]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  Jun Tani,et al.  Adaptive Detrending to Accelerate Convolutional Gated Recurrent Unit Training for Contextual Video Recognition , 2017, Neural Networks.

[12]  Cees Snoek,et al.  VideoLSTM convolves, attends and flows for action recognition , 2016, Comput. Vis. Image Underst..

[13]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[14]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Mehrtash Tafazzoli Harandi,et al.  Going deeper into action recognition: A survey , 2016, Image Vis. Comput..

[16]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[17]  Lin Sun,et al.  Lattice Long Short-Term Memory for Human Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Daniel Roggen,et al.  Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition , 2016, Sensors.

[19]  Limin Wang,et al.  Computer Vision and Image Understanding Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice , 2022 .

[20]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Jonghyun Choi,et al.  ActionFlowNet: Learning Motion Representation for Action Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[23]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).