论文信息 - Discriminative key-component models for interaction detection and recognition

Discriminative key-component models for interaction detection and recognition

We present key-component models for selecting important temporal instants or human poses from video sequences.Human interaction key-component models are learned from weakly supervised data.We demonstrate empirical results on the VIRAT and UT-Interaction datasets. Not all frames are equal - selecting a subset of discriminative frames from a video can improve performance at detecting and recognizing human interactions. In this paper we present models for categorizing a video into one of a number of predefined interactions or for detecting these interactions in a long video sequence. The models represent the interaction by a set of key temporal moments and the spatial structures they entail. For instance: two people approaching each other, then extending their hands before engaging in a "handshaking" interaction. Learning the model parameters requires only weak supervision in the form of an overall label for the interaction. Experimental results on the UT-Interaction and VIRAT datasets verify the efficacy of these structured models for human interactions.

[1] Jake K. Aggarwal,et al. An Overview of Contest on Semantic Description of Human Activities (SDHA) 2010 , 2010, ICPR Contests.

[2] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] George Loizou,et al. Computer vision and pattern recognition , 2007, Int. J. Comput. Math..

[4] Jitendra Malik,et al. Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5] Tae-Kyun Kim,et al. Real-time Action Recognition by Spatiotemporal Semantic and Structural Forests , 2010, BMVC.

[6] Tsuhan Chen,et al. Spatio-Temporal Phrases for Activity Recognition , 2012, ECCV.

[7] J.K. Aggarwal,et al. Human activity analysis , 2011, ACM Comput. Surv..

[8] Andrew Zisserman,et al. Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[9] Larry S. Davis,et al. AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[10] Thierry Artières,et al. Large margin training for hidden Markov models with partially observed states , 2009, ICML '09.

[11] Juan Carlos Niebles,et al. Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[12] Adriana Kovashka,et al. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[13] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[14] Barbara Caputo,et al. Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[15] Robert Morgan,et al. Dark Energy , 2015 .

[16] Cordelia Schmid,et al. Actions in context , 2009, CVPR.

[17] Andrea Vedaldi,et al. Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[18] Bo Gao,et al. A discriminative key pose sequence model for recognizing human interactions , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[19] Svetlana Lazebnik,et al. Supervised Learning of Quantizer Codebooks by Information Loss Minimization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Amit K. Roy-Chowdhury,et al. Context-Aware Modeling and Recognition of Activities in Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[21] Michael S. Ryoo,et al. Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[22] Hema Swetha Koppula,et al. Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[23] Danica Kragic,et al. Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[24] Juan Carlos Niebles,et al. Unsupervised Learning of Human Actions Using Spatial-Temporal Words , 2006 .

[25] Luc Van Gool,et al. A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26] Juan Carlos Niebles,et al. Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[27] Aaron F. Bobick,et al. Recognizing Planned, Multiperson Action , 2001, Comput. Vis. Image Underst..

[28] Martial Hebert,et al. Event Detection in Crowded Videos , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[29] Ronald Poppe,et al. A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[30] Michael Isard,et al. Lost in quantization: Improving particular object retrieval in large scale image databases , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31] Yang Wang,et al. Discriminative Latent Models for Recognizing Contextual Group Activities , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32] Mohamed R. Amer,et al. Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition , 2012, ECCV.

[33] J. Einasto. Dark Matter , 2009, 0901.0632.

[34] Barbara Caputo,et al. Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[35] Leonid Sigal,et al. Poselet Key-Framing: A Model for Human Activity Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36] Cordelia Schmid,et al. Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[37] Yunde Jia,et al. Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[38] Hedvig Kjellström,et al. Functional object descriptors for human activity modeling , 2013, 2013 IEEE International Conference on Robotics and Automation.

[39] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40] Ramakant Nevatia,et al. Single View Human Action Recognition using Key Pose Matching and Viterbi Path Searching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Yang Wang,et al. Human Action Recognition by Semilatent Topic Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42] Song-Chun Zhu,et al. Inferring "Dark Matter" and "Dark Energy" from Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[43] Junji Yamato,et al. Recognizing human action in time-sequential images using hidden Markov model , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[44] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[45] Mubarak Shah,et al. Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[46] Silvio Savarese,et al. A Unified Framework for Multi-target Tracking and Collective Activity Recognition , 2012, ECCV.

[47] Rémi Ronfard,et al. A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[48] Chih-Jen Lin,et al. LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[49] Larry S. Davis,et al. AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[50] Larry S. Davis,et al. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51] Yang Wang,et al. Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52] William Brendel,et al. Activities as Time Series of Human Postures , 2010, ECCV.

[53] Ming-Hsuan Yang,et al. Robust Object Tracking with Online Multiple Instance Learning , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54] Ming-Hsuan Yang,et al. Incremental Learning for Robust Visual Tracking , 2008, International Journal of Computer Vision.

[55] Fei-Fei Li,et al. Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[56] Jake K. Aggarwal,et al. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[57] Larry S. Davis,et al. Combining Per-frame and Per-track Cues for Multi-person Action Recognition , 2012, ECCV.