Objects2action: Classifying and Localizing Actions without Any Video Example

The goal of this paper is to recognize actions in video without the need for examples. Different from traditional zero-shot approaches we do not demand the design and specification of attribute classifiers and class-to-attribute mappings to allow for transfer from seen classes to unseen classes. Our key contribution is objects2action, a semantic word embedding that is spanned by a skip-gram model of thousands of object categories. Action labels are assigned to an object encoding of unseen video based on a convex combination of action and object affinities. Our semantic embedding has three main characteristics to accommodate for the specifics of actions. First, we propose a mechanism to exploit multiple-word descriptions of actions and objects. Second, we incorporate the automated selection of the most responsive objects per action. And finally, we demonstrate how to extend our zero-shot approach to the spatio-temporal localization of actions in video. Experiments on four action datasets demonstrate the potential of our approach.

[1]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Zicheng Liu,et al.  Cross-dataset action detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Brian Antonishek TRECVID 2010 – An Introduction to the Goals , Tasks , Data , Evaluation Mechanisms , and Metrics , 2010 .

[7]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[8]  Yang Wang,et al.  Discriminative figure-centric models for joint action localization and recognition , 2011, 2011 International Conference on Computer Vision.

[9]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[10]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[11]  Bernt Schiele,et al.  Evaluating knowledge transfer and zero-shot learning in a large-scale setting , 2011, CVPR 2011.

[12]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[14]  Hao Su,et al.  Object Bank: An Object-Level Image Representation for High-Level Visual Recognition , 2014, International Journal of Computer Vision.

[15]  Cees G. M. Snoek,et al.  The MediaMill at TRECVID 2013: : Searching concepts, Objects, Instances and events in video , 2013, TRECVID.

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[18]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[19]  Florent Perronnin,et al.  Textual Similarity with a Bag-of-Embedded-Words Model , 2013, ICTIR.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Cordelia Schmid,et al.  Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[24]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Mubarak Shah,et al.  Spatiotemporal Deformable Part Models for Action Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Babak Saleh,et al.  Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[28]  Shuang Wu,et al.  Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Yu Qiao,et al.  Action Recognition with Stacked Fisher Vectors , 2014, ECCV.

[30]  Patrick Bouthemy,et al.  Action Localization with Tubelets from Motion , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Dimitri Kartsaklis,et al.  Evaluating Neural Word Representations in Tensor-Based Compositional Settings , 2014, EMNLP.

[32]  Cordelia Schmid,et al.  Spatio-temporal Object Detection Proposals , 2014, ECCV.

[33]  Theo Gevers,et al.  Evaluation of Color Spatio-Temporal Interest Points for Human Action Recognition , 2014, IEEE Transactions on Image Processing.

[34]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[35]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[36]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Ramakant Nevatia,et al.  Semantic Aware Video Transcription Using Random Forest Classifiers , 2014, ECCV.

[38]  Dong Liu,et al.  Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images , 2014, ICMR.

[39]  Limin Wang,et al.  Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics , 2014, ECCV.

[40]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[41]  Cees Snoek,et al.  COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Teruko Mitamura,et al.  Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[43]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[44]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Cees Snoek,et al.  APT: Action localization proposals from dense trajectories , 2015, BMVC.

[46]  Cees Snoek,et al.  What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.