论文信息 - Learning motion representation for real-time spatio-temporal action localization

Learning motion representation for real-time spatio-temporal action localization

Abstract The current deep learning based spatio-temporal action localization methods that using motion information (predominated is optical flow) obtain the state-of-the-art performance. However, since the optical flow is pre-computed, leading to these methods face two problems – the computational efficiency is low and the whole network is not end-to-end trainable. We propose a novel spatio-temporal action localization approach with an integrated optical flow sub-network to address these two issues. Specifically, our designed flow subnet can estimate optical flow efficiently and accurately by using multiple consecutive RGB frames rather than two adjacent frames in a deep network, simultaneously, action localization is implemented in the same network interactive with flow computation end-to-end. To faster the speed, we exploit a neural network based feature fusion method in a pyramid hierarchical manner. It fuses spatial and temporal features at different granularities via combination function (i.e. concatenation) and point-wise convolution to obtain multiscale spatio-temporal action features. Experimental results on three publicly available datasets, e.g. UCF101-24, JHMDB and AVA show that with both RGB appearance and optical flow cues, the proposed method gets the state-of-the-art performance in both efficiency and accuracy. Noticeably, it gets a significant improvement on efficiency. Compared to the currently most efficient method, it is 1.9 times faster in the running speed and 1.3% video-mAP more accurate on the UCF101-24. Our proposed method reaches real-time computation for the first time (up to 38 FPS).

[1] Yunhong Wang,et al. Receptive Field Block Net for Accurate and Fast Object Detection , 2017, ECCV.

[2] Luc Van Gool,et al. The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[3] Suman Saha,et al. AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4] Jan Kautz,et al. STEP: Spatio-Temporal Progressive Learning for Video Action Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6] Abdulmotaleb El-Saddik,et al. Optical flow estimation using channel attention mechanism and dilated convolutional neural networks , 2019, Neurocomputing.

[7] Dejun Zhang,et al. Pointwise geometric and semantic learning network on 3D point clouds , 2019, Integr. Comput. Aided Eng..

[8] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9] Baoxin Li,et al. Multi-stream CNN: Learning representations based on human-related regions for action recognition , 2018, Pattern Recognit..

[10] Michael J. Black,et al. A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them , 2013, International Journal of Computer Vision.

[11] Jan Kautz,et al. Models Matter, So Does Training: An Empirical Study of CNNs for Optical Flow Estimation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Juergen Gall,et al. Weakly supervised learning of actions from transcripts , 2016, Comput. Vis. Image Underst..

[13] Cordelia Schmid,et al. Multi-region Two-Stream R-CNN for Action Detection , 2016, ECCV.

[14] Gang Yu,et al. Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Cordelia Schmid,et al. Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16] Cordelia Schmid,et al. Actor-Centric Relation Network , 2018, ECCV.

[17] Baoxin Li,et al. A survey of variational and CNN-based optical flow techniques , 2019, Signal Process. Image Commun..

[18] Hong Qiao,et al. Un-supervised and semi-supervised hand segmentation in egocentric images with noisy label learning , 2019, Neurocomputing.

[19] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[20] Jiawei He,et al. Generic Tubelet Proposals for Action Localization , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[21] Mengting Luo,et al. Reconstructed similarity for faster GANs-based word translation to mitigate hubness , 2019, Neurocomputing.

[22] Remco C. Veltkamp,et al. A combined post-filtering method to improve accuracy of variational optical flow estimation , 2014, Pattern Recognit..

[23] Jitendra Malik,et al. Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Caroline Petitjean,et al. Improving pattern spotting in historical documents using feature pyramid networks , 2020, Pattern Recognit. Lett..

[25] Thomas Brox,et al. High Accuracy Optical Flow Estimation Based on a Theory for Warping , 2004, ECCV.

[26] Cordelia Schmid,et al. Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Hendry,et al. Automatic License Plate Recognition via sliding-window darknet-YOLO deep learning , 2019, Image Vis. Comput..

[29] Azeddine Beghdadi,et al. Spatio-temporal action localization and detection for human action recognition in big dataset , 2016, J. Vis. Commun. Image Represent..

[30] Cordelia Schmid,et al. Action Tubelet Detector for Spatio-Temporal Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31] Jürgen Schmidhuber,et al. Deep learning in neural networks: An overview , 2014, Neural Networks.

[32] Thomas Brox,et al. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Navvab Afrashteh,et al. Optical-flow analysis toolbox for characterization of spatiotemporal dynamics in mesoscale optical imaging of brain activity , 2016, NeuroImage.

[34] Sergio Escalera,et al. RGB-D-based Human Motion Recognition with Deep Learning: A Survey , 2017, Comput. Vis. Image Underst..

[35] Dominique Legendre,et al. Image processing for the experimental investigation of dense dispersed flows: Application to bubbly flows , 2019, International Journal of Multiphase Flow.

[36] Suman Saha,et al. Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[37] Rui Hou,et al. Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38] Huchuan Lu,et al. Hyperfusion-Net: Hyper-densely reflective feature fusion for salient object detection , 2019, Pattern Recognit..

[39] Mubarak Shah,et al. Automatic action annotation in weakly labeled videos , 2016, Comput. Vis. Image Underst..

[40] Nicola Conci,et al. How Deep Features Have Improved Event Recognition in Multimedia , 2019, ACM Trans. Multim. Comput. Commun. Appl..