Actionness Estimation Using Hybrid Fully Convolutional Networks

Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location. Accurate and efficient estimation of actionness is important in video analysis and may benefit other relevant tasks such as action recognition and action detection. This paper presents a new deep architecture for actionness estimation, called hybrid fully convolutional network (HFCN), which is composed of appearance FCN (A-FCN) and motion FCN (M-FCN). These two FCNs leverage the strong capacity of deep models to estimate actionness maps from the perspectives of static appearance and dynamic motion, respectively. In addition, the fully convolutional nature of H-FCN allows it to efficiently process videos with arbitrary sizes. Experiments are conducted on the challenging datasets of Stanford40, UCF Sports, and JHMDB to verify the effectiveness of H-FCN on actionness estimation, which demonstrate that our method achieves superior performance to previous ones. Moreover, we apply the estimated actionness maps on action proposal generation and action detection. Our actionness maps advance the current state-of-the-art performance of these tasks substantially.

[1]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[2]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[3]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[7]  Limin Wang,et al.  Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[8]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xiaogang Wang,et al.  Slicing Convolutional Neural Network for Crowd Video Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[11]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Ran Xu,et al.  Human action segmentation with hierarchical supervoxel consistency , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Gang Yu,et al.  Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[18]  Limin Wang,et al.  Motionlets: Mid-level 3D Parts for Human Motion Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Dahua Lin,et al.  Recognize complex events from static images by fusing deep channels , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Limin Wang,et al.  Latent Hierarchical Model of Temporal Structure for Complex Activity Classification , 2014, IEEE Transactions on Image Processing.

[22]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[23]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[24]  Chenliang Xu,et al.  Evaluation of super-voxel methods for early video processing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[27]  Limin Wang,et al.  MoFAP: A Multi-level Representation for Action Recognition , 2015, International Journal of Computer Vision.

[28]  Wei Chen,et al.  Actionness Ranking with Lattice Conditional Ordinal Random Fields , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[30]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[31]  Yi Yang,et al.  DevNet: A Deep Event Network for multimedia event detection and evidence recounting , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Takeo Kanade,et al.  Background Subtraction for Freely Moving Cameras , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[33]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Yi Yang,et al.  Exploring Semantic Inter-Class Relationships (SIR) for Zero-Shot Action Recognition , 2015, AAAI.

[35]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Jitendra Malik,et al.  Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[38]  Deli Zhao,et al.  Recognizing an Action Using Its Name: A Knowledge-Based Approach , 2016, International Journal of Computer Vision.

[39]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[40]  Zhe Wang,et al.  Towards Good Practices for Very Deep Two-Stream ConvNets , 2015, ArXiv.

[41]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Santiago Manen,et al.  Online Video SEEDS for Temporal Window Objectness , 2013, 2013 IEEE International Conference on Computer Vision.

[43]  Bernt Schiele,et al.  How good are detection proposals, really? , 2014, BMVC.

[44]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[45]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47]  Xiaogang Wang,et al.  Deeply learned attributes for crowded scene understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[50]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[51]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.