论文信息 - Objects2action: Classifying and Localizing Actions without Any Video Example

Objects2action: Classifying and Localizing Actions without Any Video Example

The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach.

[1] Mubarak Shah,et al. Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Christoph H. Lampert,et al. Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3] Ali Farhadi,et al. Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5] Zicheng Liu,et al. Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6] Brian Antonishek. TRECVID 2010 – An Introduction to the Goals , Tasks , Data , Evaluation Mechanisms , and Metrics , 2010 .

[7] Kristen Grauman,et al. Relative attributes , 2011, 2011 International Conference on Computer Vision.

[8] Yang Wang,et al. Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[9] Silvio Savarese,et al. Recognizing human actions by attributes , 2011, CVPR 2011.

[10] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[11] Bernt Schiele,et al. Evaluating knowledge transfer and zero-shot learning in a large-scale setting , 2011, CVPR 2011.

[12] Jason J. Corso,et al. Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[14] Hao Su,et al. Object Bank: An Object-Level Image Representation for High-Level Visual Recognition , 2014, International Journal of Computer Vision.

[15] Cees G. M. Snoek,et al. The MediaMill at TRECVID 2013: : Searching concepts, Objects, Instances and events in video , 2013, TRECVID.

[16] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[18] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[19] Florent Perronnin,et al. Textual Similarity with a Bag-of-Embedded-Words Model , 2013, ICTIR.

[20] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21] Patrick Bouthemy,et al. Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Cordelia Schmid,et al. Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23] Thomas Mensink,et al. Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[24] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[25] Cordelia Schmid,et al. Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[26] Mubarak Shah,et al. Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Babak Saleh,et al. Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[28] Shuang Wu,et al. Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29] Yu Qiao,et al. Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[30] Patrick Bouthemy,et al. Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31] Dimitri Kartsaklis,et al. Evaluating Neural Word Representations in Tensor-Based Compositional Settings , 2014, EMNLP.

[32] Cordelia Schmid,et al. Spatio-temporal Object Detection Proposals , 2014, ECCV.

[33] Theo Gevers,et al. Evaluation of Color Spatio-Temporal Interest Points for Human Action Recognition , 2014, IEEE Transactions on Image Processing.

[34] Cees Snoek,et al. Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[35] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[36] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37] Ramakant Nevatia,et al. Semantic Aware Video Transcription Using Random Forest Classifiers , 2014, ECCV.

[38] Dong Liu,et al. Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images , 2014, ICMR.

[39] Limin Wang,et al. Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics , 2014, ECCV.

[40] Samy Bengio,et al. Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[41] Cees Snoek,et al. COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42] Teruko Mitamura,et al. Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[43] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[44] Yi Yang,et al. A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Cees Snoek,et al. APT: Action localization proposals from dense trajectories , 2015, BMVC.

[46] Cees Snoek,et al. What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] David A. Shamma,et al. The New Data and New Challenges in Multimedia Research , 2015, ArXiv.