Dimensionality Deduction for Action Proposals: To Extract or to Select?

Action detection is an important item in machine vision. Currently some works based on deep learning framework have achieved impressive accuracy, however they still suffer from the problem of low speed. To solve this, researchers have introduced many feasible methods and temporal action proposals method is the most effective one. Fed with features extracted from videos, these methods will propose time clips which may contain actions to reduce computational workload. It is a common way to use 3D convolutions (C3D) to extract spatio-temporal features from videos, nonetheless the dimension of these features is generally high, resulting sparse distribution in each dimension. Thus, it is necessary to apply dimension reduction method in the process of temporal proposals. In this research, we experimentally find that in action detection proposal task, reducing the dimension of features is important. Because it cannot only accelerate the process of subsequent temporal proposals but also makes its performance better. Experimental results on the THUMOS 2014 dataset demonstrate that the method of feature extraction reduction is more suitable for temporal action proposals than feature selection method.

[1]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[2]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[3]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[5]  Bernard Ghanem,et al.  DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.

[6]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ivor W. Tsang,et al.  Learning Sparse SVM for Feature Selection on Very High Dimensional Datasets , 2010, ICML.

[10]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Bernard Ghanem,et al.  SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Bernard Ghanem,et al.  Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Chunhua Deng,et al.  Action prediction based on dense trajectory and dynamic image , 2017, 2017 Chinese Automation Congress (CAC).

[14]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.