Video2vec Embeddings Recognize Events When Examples Are Scarce

This paper aims for event recognition when video examples are scarce or even completely absent. The key in such a challenging setting is a semantic video representation. Rather than building the representation from individual attribute detectors and their annotations, we propose to learn the entire representation from freely available web videos and their descriptions using an embedding between video features and term vectors. In our proposed embedding, which we call Video2vec, the correlations between the words are utilized to learn a more effective representation by optimizing a joint objective balancing descriptiveness and predictability. We show how learning the Video2vec embedding using a multimodal predictability loss, including appearance, motion and audio features, results in a better predictable representation. We also propose an event specific variant of Video2vec to learn a more accurate representation for the words, which are indicative of the event, by introducing a term sensitive descriptiveness loss. Our experiments on three challenging collections of web videos from the NIST TRECVID Multimedia Event Detection and Columbia Consumer Videos datasets demonstrate: i) the advantages of Video2vec over representations using attributes or alternative embeddings, ii) the benefit of fusing video modalities by an embedding over common strategies, iii) the complementarity of term sensitive descriptiveness and multimodal predictability for event recognition. By its ability to improve predictability of present day audio-visual video features, while at the same time maximizing their semantic descriptiveness, Video2vec leads to state-of-the-art accuracy for both few- and zero-example recognition of events in video.

[1]  Gang Hua,et al.  Semantic Model Vectors for Complex Video Event Recognition , 2012, IEEE Transactions on Multimedia.

[2]  Mubarak Shah,et al.  Recognizing Complex Events Using Large Margin Joint Low-Level Event Model , 2012, ECCV.

[3]  Dong Liu,et al.  Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images , 2014, ICMR.

[4]  Teruko Mitamura,et al.  Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[5]  Jiebo Luo,et al.  Utilizing semantic word similarity measures for video retrieval , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Zhigang Ma,et al.  From Concepts to Events: a Progressive Process for Multimedia content Analysis , 2013 .

[7]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[8]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[9]  Quan Wang,et al.  Regularized latent semantic indexing , 2011, SIGIR.

[10]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[11]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[12]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Dong Liu,et al.  Encoding Concept Prototypes for Video Event Detection and Summarization , 2015, ICMR.

[14]  Shuang Wu,et al.  Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[16]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[17]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[19]  Yi Yang,et al.  Fast and Accurate Content-based Semantic Search in 100M Internet Videos , 2015, ACM Multimedia.

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[22]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[23]  Kilian Q. Weinberger,et al.  Large Margin Taxonomy Embedding for Document Categorization , 2008, NIPS.

[24]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[25]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[26]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[27]  Paul Over,et al.  Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[28]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Jean Ponce,et al.  Sparse Modeling for Image and Vision Processing , 2014, Found. Trends Comput. Graph. Vis..

[30]  Nicu Sebe,et al.  Complex Event Detection via Multi-source Video Attributes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Afshin Dehghan,et al.  Improving Semantic Concept Detection through the Dictionary of Visually-Distinct Elements , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[34]  A. G. Amitha Perera,et al.  Multimedia event detection with multimodal feature fusion and temporal concept localization , 2013, Machine Vision and Applications.

[35]  Pradipto Das,et al.  Translating related words to videos and back through latent topics , 2013, WSDM.

[36]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[37]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  James Allan,et al.  Zero-shot video retrieval using content and concepts , 2013, CIKM.

[39]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[40]  Mubarak Shah,et al.  Recognition of Complex Events: Exploiting Temporal Dynamics between Underlying Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[42]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Cees Snoek,et al.  Recommendations for recognizing video events by concept vocabularies , 2014, Comput. Vis. Image Underst..

[45]  Dong Liu,et al.  Building A Large Concept Bank for Representing Events in Video , 2014, ArXiv.

[46]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[47]  Shiguang Shan,et al.  Informedia@TrecVID 2014: MED and MER , 2014 .

[48]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[49]  Cees Snoek,et al.  VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[50]  Cordelia Schmid,et al.  Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[52]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[54]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[55]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Ramakant Nevatia,et al.  DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Cees Snoek,et al.  Objects2action: Classifying and Localizing Actions without Any Video Example , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[60]  Dong Liu,et al.  EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[61]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[62]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[64]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[65]  Larry S. Davis,et al.  VRFP: On-the-Fly Video Retrieval Using Web Images and Fast Fisher Vector Products , 2015, IEEE Transactions on Multimedia.

[66]  Yi Yang,et al.  Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection , 2015, IJCAI.

[67]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[69]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[70]  Nuno Vasconcelos,et al.  Dynamic Pooling for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[71]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[72]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Dong Liu,et al.  BBNVISER : BBN VISER TRECVID 2012 Multimedia Event Detection and Multimedia Event Recounting Systems , 2012, TRECVID.

[74]  Cees Snoek,et al.  UvA-DARE ( Digital Academic Repository ) Event Fisher Vectors : Robust Encoding Visual Diversity of Visual Streams , 2015 .

[75]  Hagai Attias,et al.  Topic regression multi-modal Latent Dirichlet Allocation for image annotation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[76]  Fei-Fei Li,et al.  Combining the Right Features for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[77]  Mubarak Shah,et al.  Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[78]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[79]  Nicu Sebe,et al.  Complex Event Detection via Event Oriented Dictionary Learning , 2015, AAAI.

[80]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[81]  Yi Yang,et al.  Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision , 2015, ACM Multimedia.

[82]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[83]  Masoud Mazloom,et al.  Conceptlets: Selective Semantics for Classifying Video Events , 2014, IEEE Transactions on Multimedia.