Fast cascaded action localization in video using frame alignment

Locating human actions in videos is challenging because of the complexity and variability of human motions, as well as of the amount of video data to be searched. We propose a method that detects and locates a set of actions in a video database by taking into account their temporal structure at the frame level. While other methods aggregate frames into action parts, we leverage the complementarity between aggregation and frame level comparison of sequences. Combining these two techniques in a cascade, we aim to address large scale retrieval. Evaluation on popular datasets show state of the art results, as well as efficient detection and low storage requirements.

[1]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[2]  C. Schmid,et al.  Recognizing activities with cluster-trees of tracklets , 2012, BMVC.

[3]  Ying Wu,et al.  Discriminative Video Pattern Search for Efficient Action Detection , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Cordelia Schmid,et al.  Actom sequence models for efficient action detection , 2011, CVPR 2011.

[5]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Gang Yu,et al.  Real-time human action search using random forest based hough voting , 2011, ACM Multimedia.

[7]  Gang Yu,et al.  Unsupervised random forest indexing for fast action search , 2011, CVPR 2011.

[8]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[9]  Marco Cuturi,et al.  Fast Global Alignment Kernels , 2011, ICML.

[10]  Stefano Soatto,et al.  Tracklet Descriptors for Action Modeling and Video Analysis , 2010, ECCV.

[11]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ying Wu,et al.  Speeding up spatio-temporal sliding-window search for efficient event detection in crowded videos , 2009, EiMM '09.

[13]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[14]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[16]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[17]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[18]  Cordelia Schmid,et al.  A time series kernel for action recognition , 2011, BMVC.

[19]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.