Fast action proposals for human action detection and search

In this paper we target at generating generic action proposals in unconstrained videos. Each action proposal corresponds to a temporal series of spatial bounding boxes, i.e., a spatio-temporal video tube, which has a good potential to locate one human action. Assuming each action is performed by a human with meaningful motion, both appearance and motion cues are utilized to measure the actionness of the video tubes. After picking those spatiotemporal paths of high actionness scores, our action proposal generation is formulated as a maximum set coverage problem, where greedy search is performed to select a set of action proposals that can maximize the overall actionness score. Compared with existing action proposal approaches, our action proposals do not rely on video segmentation and can be generated in nearly real-time. Experimental results on two challenging datasets, MSRII and UCF 101, validate the superior performance of our action proposals as well as competitive results on action detection and search.

[1]  Limin Wang,et al.  Video Action Detection with Relational Dynamic-Poselets , 2014, ECCV.

[2]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[4]  Fernando De la Torre,et al.  Unsupervised Temporal Commonality Discovery , 2012, ECCV.

[5]  Thomas Deselaers,et al.  What is an object? , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Jonathon Shlens,et al.  Fast, Accurate Detection of 100,000 Object Classes on a Single Machine , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  P. Siva,et al.  Action Detection in Crowd , 2010, BMVC.

[9]  Chenliang Xu,et al.  Evaluation of super-voxel methods for early video processing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Martial Hebert,et al.  Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[12]  Gang Yu,et al.  Unsupervised random forest indexing for fast action search , 2011, CVPR 2011.

[13]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  D. Forsyth,et al.  Video Event Detection: From Subvolume Localization To Spatio-Temporal Path Search. , 2013, IEEE transactions on pattern analysis and machine intelligence.

[15]  Cristian Sminchisescu,et al.  Video Object Segmentation by Salient Segment Chain Composition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[16]  Gang Yu,et al.  Real-time human action search using random forest based hough voting , 2011, ACM Multimedia.

[17]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[18]  Greg Mori,et al.  Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization , 2013, NIPS.

[19]  Mubarak Shah,et al.  Recognizing human actions , 2005, VSSN@MM.

[20]  Mohamed R. Amer,et al.  Multiobject tracking as maximum weight independent set , 2011, CVPR 2011.

[21]  Charless C. Fowlkes,et al.  Globally-optimal greedy algorithms for tracking a variable number of objects , 2011, CVPR 2011.

[22]  Ying Wu,et al.  Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Luc Van Gool,et al.  Pedestrian detection at 100 frames per second , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[26]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[27]  Wei Chen,et al.  Actionness Ranking with Lattice Conditional Ordinal Random Fields , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[29]  Junsong Yuan,et al.  Max-Margin Structured Output Regression for Spatio-Temporal Action Localization , 2012, NIPS.

[30]  Luc Van Gool,et al.  Hough Forests for Object Detection, Tracking, and Action Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[32]  Gang Yu,et al.  Propagative Hough Voting for Human Activity Recognition , 2012, ECCV.

[33]  Tao Xiang,et al.  Weakly Supervised Action Detection , 2011, BMVC.

[34]  David A. Forsyth,et al.  Video Event Detection: From Subvolume Localization to Spatiotemporal Path Search , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Xilin Chen,et al.  A unified framework for locating and recognizing human actions , 2011, CVPR 2011.

[36]  Nazli Ikizler-Cinbis,et al.  Object, Scene and Actions: Combining Multiple Features for Human Action Recognition , 2010, ECCV.

[37]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[38]  Santiago Manen,et al.  Online Video SEEDS for Temporal Window Objectness , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Bernt Schiele,et al.  How good are detection proposals, really? , 2014, BMVC.

[40]  Thomas Brox,et al.  A Unified Video Segmentation Benchmark: Annotation, Metrics and Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Pietro Perona,et al.  Fast Feature Pyramids for Object Detection , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Nazli Ikizler-Cinbis,et al.  Action Recognition and Localization by Hierarchical Space-Time Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[43]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[44]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[45]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.