Low-Latency Human Action Recognition with Weighted Multi-Region Convolutional Neural Network

Spatio-temporal contexts are crucial in understanding human actions in videos. Recent state-of-the-art Convolutional Neural Network (ConvNet) based action recognition systems frequently involve 3D spatio-temporal ConvNet filters, chunking videos into fixed length clips and Long Short Term Memory (LSTM) networks. Such architectures are designed to take advantage of both short term and long term temporal contexts, but also requires the accumulation of a predefined number of video frames (e.g., to construct video clips for 3D ConvNet filters, to generate enough inputs for LSTMs). For applications that require low-latency online predictions of fast-changing action scenes, a new action recognition system is proposed in this paper. Termed "Weighted Multi-Region Convolutional Neural Network" (WMR ConvNet), the proposed system is LSTM-free, and is based on 2D ConvNet that does not require the accumulation of video frames for 3D ConvNet filtering. Unlike early 2D ConvNets that are based purely on RGB frames and optical flow frames, the WMR ConvNet is designed to simultaneously capture multiple spatial and short term temporal cues (e.g., human poses, occurrences of objects in the background) with both the primary region (foreground) and secondary regions (mostly background). On both the UCF101 and HMDB51 datasets, the proposed WMR ConvNet achieves the state-of-the-art performance among competing low-latency algorithms. Furthermore, WMR ConvNet even outperforms the 3D ConvNet based C3D algorithm that requires video frame accumulation. In an ablation study with the optical flow ConvNet stream removed, the ablated WMR ConvNet nevertheless outperforms competing algorithms.

[1]  Houqiang Li,et al.  Weighted Multi-Region Convolutional Neural Network for Action Recognition With Low-Latency Online Prediction , 2018, 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[2]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Lei Wu,et al.  Effective Active Skeleton Representation for Low Latency Human Action Recognition , 2016, IEEE Transactions on Multimedia.

[4]  Jian Li,et al.  Fast implementation of sparse iterative covariance-based estimation for array processing , 2011, 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[5]  Jitendra Malik,et al.  Contextual Action Recognition with R*CNN , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[10]  Houqiang Li,et al.  An effective representation for action recognition with human skeleton joints , 2014, Photonics Asia.

[11]  Gang Hua,et al.  Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition , 2018, AIAI.

[12]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Gang Hua,et al.  Auxiliary Training Information Assisted Visual Recognition , 2015, IPSJ Trans. Comput. Vis. Appl..

[14]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[15]  Houqiang Li,et al.  Enhanced Action Recognition With Visual Attribute-Augmented 3D Convolutional Neural Network , 2018, 2018 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[16]  Jian Li,et al.  Fast implementation of sparse iterative covariance-based estimation for source localization. , 2012, The Journal of the Acoustical Society of America.

[17]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[18]  Jian Li,et al.  Iterative Sparse Asymptotic Minimum Variance Based Approaches for Array Processing , 2013, IEEE Transactions on Signal Processing.

[19]  Yanning Zhang,et al.  Convolutional Neural Network-Based Robot Navigation Using Uncalibrated Spherical Images , 2017, Sensors.

[20]  Nanning Zheng,et al.  Video Object Co-Segmentation from Noisy Videos by a Multi-Level Hypergraph Model , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[21]  Gang Hua,et al.  Multi-View Visual Recognition of Imperfect Testing Data , 2015, ACM Multimedia.

[22]  Gang Hua,et al.  Can Visual Recognition Benefit from Auxiliary Information in Training? , 2014, ACCV.

[23]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[25]  Nanning Zheng,et al.  Joint Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[26]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[28]  Wei Wei,et al.  A Hyperspectral Image Classification Framework with Spatial Pixel Pair Features , 2017, Sensors.

[29]  Luc Van Gool,et al.  Two-Stream SR-CNNs for Action Recognition in Videos , 2016, BMVC.

[30]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[32]  Houqiang Li,et al.  Attribute Mining for Scalable 3D Human Action Recognition , 2015, ACM Multimedia.

[33]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.