Action Detection with Actom Sequence Models

We address the problem of detecting actions, such as drinking or opening a door, in hours of challenging video data. We propose a model based on a sequence of atomic action units, termed "actoms", that are semantically meaningful and characteristic for the action. Our Actom Sequence Model (ASM) represents the temporal structure of actions as a sequence of histograms of actom-anchored visual features. Our representation, which can be seen as a temporally structured extension of the bag-of-features, is flexible, sparse, and discriminative. Training requires the annotation of actoms for action examples. At test time, actoms are detected automatically based on a non-parametric model of the distribution of actoms, which also acts as a prior on an action's temporal structure. We present experimental results on two recent benchmarks for temporal action detection: "Coffee and Cigarettes" and the "DLSB" dataset. We also adapt our approach to a classification by detection set-up and demonstrate its applicability on the challenging "Hollywood 2" dataset. We show that our ASM method outperforms the current state of the art in temporal action detection, as well as baselines that detect actions with a sliding window method combined with bag-of-features.

[1]  Aaron F. Bobick,et al.  Learning visual behavior for gesture analysis , 1995, Proceedings of International Symposium on Computer Vision - ISCV.

[2]  Dong Han,et al.  Selection and context for action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[3]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Patrick Pérez,et al.  Retrieving actions in movies , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[5]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Martial Hebert,et al.  Modeling the Temporal Extent of Actions , 2010, ECCV.

[7]  Cordelia Schmid,et al.  Human Focused Action Localization in Video , 2010, ECCV Workshops.

[8]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[9]  Lihi Zelnik-Manor,et al.  Event-based analysis of video , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[10]  Matthew Brand,et al.  Discovery and Segmentation of Activities in Video , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[12]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[14]  Alex Pentland,et al.  Space-time gestures , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[17]  Cordelia Schmid,et al.  Mining Visual Actions from Movies , 2009, BMVC.

[18]  Jiebo Luo,et al.  Recognizing realistic actions from videos , 2009, CVPR.

[19]  Qiang Ji,et al.  Knowledge Based Activity Recognition with Dynamic Bayesian Network , 2010, ECCV.

[20]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[21]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[22]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2008, International Journal of Computer Vision.

[24]  Radu Horaud,et al.  An Unsupervised Framework for Action Recognition Using Actemes , 2010, ACCV.

[25]  Tae-Kyun Kim,et al.  Canonical Correlation Analysis of Video Volume Tensors for Action Categorization and Detection , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  James L. Crowley,et al.  Probabilistic recognition of activity using local appearance , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[31]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[32]  Eli Shechtman,et al.  Space-time behavior based correlation , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[33]  Andrew Gilbert,et al.  Action Recognition Using Mined Hierarchical Compound Features , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Aaron F. Bobick,et al.  Parametric Hidden Markov Models for Gesture Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Li Wang,et al.  Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models , 2011, International Journal of Computer Vision.

[36]  William Brendel,et al.  Activities as Time Series of Human Postures , 2010, ECCV.

[37]  Nazli Ikizler-Cinbis,et al.  Learning actions from the Web , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[38]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[40]  David J. Kriegman,et al.  Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Ramakant Nevatia,et al.  Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[43]  Junji Yamato,et al.  Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[44]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[45]  Alex Pentland,et al.  Real-time American Sign Language recognition from video using hidden Markov models , 1995 .

[46]  Jake K. Aggarwal,et al.  Modeling human activities as speech , 2011, CVPR 2011.

[47]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[48]  Martial Hebert,et al.  Volumetric Features for Video Event Detection , 2010, International Journal of Computer Vision.