Action Recognition Based on Two-Stream Convolutional Networks With Long-Short-Term Spatiotemporal Features

Human action recognition is an important research topic in the field of computer vision due to its application values. Recently, a variety of approaches based on deep learning features have been proposed due to the effectiveness of deep neural networks. But most of these approaches are not able to fully extract spatiotemporal features from videos, because of the lack of consideration of the diversity of scales in temporal domain. In this paper, we propose a two-stream convolutional network with long-short-term spatiotemporal features (LSF CNN) for human action recognition task. The network is mainly composed of two subnetworks. One is long-term spatiotemporal features extraction network (LT-Net) that takes the stacked RGB images as inputs. Another one is short-term spatiotemporal features extraction network (ST-Net) that takes the optical flow as input, which is estimated from two adjacent frames. The two-scale spatiotemporal features are fused in the fully-connected layer and fed into the linear support vector machine (SVM). We also propose a new expression for optical flow field, which is proved to have better performance than traditional expression in action recognition problem. With two-stream architecture, the network can fully learn deep features in both spatial and temporal domains. The experimental results on HMDB51 and UCF101 datasets indicated that the proposed approach improves the action recognition accuracy by using the long-short-term spatiotemporal information.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Kenneth Revett,et al.  Computer-aided diagnosis of human brain tumor through MRI: A survey and a new algorithm , 2014, Expert Syst. Appl..

[3]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4]  Heesung Kwon,et al.  Going Deeper With Contextual CNN for Hyperspectral Image Classification , 2016, IEEE Transactions on Image Processing.

[5]  Tieniu Tan,et al.  Wasserstein CNN: Learning Invariant Features for NIR-VIS Face Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[7]  Albert A. Rizzo,et al.  Adapting user interfaces for gestural interaction with the flexible action and articulated skeleton toolkit , 2013, Comput. Graph..

[8]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[11]  Hakil Kim,et al.  Real-Time Human Action Recognition Using CNN Over Temporal Images for Static Video Surveillance Cameras , 2015, PCM.

[12]  Zhou Ming-Quan,et al.  Convolutional Neural Networks in Image Understanding , 2016 .

[13]  Young-Koo Lee,et al.  Feature Fusion of Deep Spatial Features and Handcrafted Spatiotemporal Features for Human Action Recognition , 2019, Sensors.

[14]  Sukhendu Das,et al.  Mutual variation of information on transfer-CNN for face recognition with degraded probe samples , 2018, Neurocomputing.

[15]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[16]  Michael J. Black,et al.  On the Integration of Optical Flow and Action Recognition , 2017, GCPR.

[17]  Yi Lin,et al.  Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[18]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  Larry S. Davis,et al.  Action Recognition with Image Based CNN Features , 2015, ArXiv.

[22]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Yao Wang,et al.  Foreground Detection with Deeply Learned Multi-Scale Spatial-Temporal Features , 2018, Sensors.

[24]  Shengping Zhang,et al.  Action recognition with multi-scale trajectory-pooled 3D convolutional descriptors , 2017, Multimedia Tools and Applications.

[25]  Mubarak Shah,et al.  Monitoring human behavior from video taken in an office environment , 2001, Image Vis. Comput..

[26]  Mohammad Rahmati,et al.  Multi-target tracking using CNN-based features: CNNMTT , 2018, Multimedia Tools and Applications.

[27]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[29]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[31]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Antonio Fernández-Caballero,et al.  Visual surveillance by dynamic visual attention method , 2006, Pattern Recognit..

[33]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Baoxin Li,et al.  MSR-CNN: Applying motion salient region based descriptors for action recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[35]  Limin Wang,et al.  MoFAP: A Multi-level Representation for Action Recognition , 2015, International Journal of Computer Vision.

[36]  Gregory D. Hager,et al.  Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation , 2016, ECCV.

[37]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[38]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Ling Shao,et al.  Spatio-Temporal Laplacian Pyramid Coding for Action Recognition , 2014, IEEE Transactions on Cybernetics.

[41]  Qi Tian,et al.  Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[42]  Thomas Brox,et al.  High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[43]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[44]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[45]  Zujun Yu,et al.  An Adaptive Track Segmentation Algorithm for a Railway Intrusion Detection System , 2019, Sensors.

[46]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Mingtao Ge,et al.  Human Action Recognition Based on Foreground Trajectory and Motion Difference Descriptors , 2019, Applied Sciences.

[48]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[50]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[51]  Konrad Schindler,et al.  Learning by Tracking: Siamese CNN for Robust Target Association , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[52]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..