VideoStory Embeddings Recognize Events when Examples are Scarce

This paper aims for event recognition when video examples are scarce or even completely absent. The key in such a challenging setting is a semantic video representation. Rather than building the representation from individual attribute detectors and their annotations, we propose to learn the entire representation from freely available web videos and their descriptions using an embedding between video features and term vectors. In our proposed embedding, which we call VideoStory, the correlations between the terms are utilized to learn a more effective representation by optimizing a joint objective balancing descriptiveness and predictability. We show how learning the VideoStory using a multimodal predictability loss, including appearance, motion and audio features, results in a better predictable representation. We also propose a variant of VideoStory to recognize an event in video from just the important terms in a text query by introducing a term sensitive descriptiveness loss. Our experiments on three challenging collections of web videos from the NIST TRECVID Multimedia Event Detection and Columbia Consumer Videos datasets demonstrate: i) the advantages of VideoStory over representations using attributes or alternative embeddings, ii) the benefit of fusing video modalities by an embedding over common strategies, iii) the complementarity of term sensitive descriptiveness and multimodal predictability for event recognition without examples. By it abilities to improve predictability upon any underlying video feature while at the same time maximizing semantic descriptiveness, VideoStory leads to state-of-the-art accuracy for both fewand zero-example recognition of events in video.

[1]  Yi Yang,et al.  Fast and Accurate Content-based Semantic Search in 100M Internet Videos , 2015, ACM Multimedia.

[2]  Yi Yang,et al.  Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision , 2015, ACM Multimedia.

[3]  Yi Yang,et al.  Semantic Concept Discovery for Large-Scale Zero-Shot Event Detection , 2015, IJCAI.

[4]  Dong Liu,et al.  Encoding Concept Prototypes for Video Event Detection and Summarization , 2015, ICMR.

[5]  Dong Liu,et al.  EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video , 2015, ACM Multimedia.

[6]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[7]  Nicu Sebe,et al.  Complex Event Detection via Event Oriented Dictionary Learning , 2015, AAAI.

[8]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[9]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yi Yang,et al.  A discriminative CNN video representation for event detection , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jean Ponce,et al.  Sparse Modeling for Image and Vision Processing , 2014, Found. Trends Comput. Graph. Vis..

[12]  Cees Snoek,et al.  VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[13]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[14]  Masoud Mazloom,et al.  Conceptlets: Selective Semantics for Classifying Video Events , 2014, IEEE Transactions on Multimedia.

[15]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[18]  Cees Snoek,et al.  Recommendations for recognizing video events by concept vocabularies , 2014, Comput. Vis. Image Underst..

[19]  Shuang Wu,et al.  Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Mubarak Shah,et al.  Recognition of Complex Events: Exploiting Temporal Dynamics between Underlying Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Afshin Dehghan,et al.  Improving Semantic Concept Detection through the Dictionary of Visually-Distinct Elements , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Ramakant Nevatia,et al.  DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[25]  Dong Liu,et al.  Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images , 2014, ICMR.

[26]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[27]  Teruko Mitamura,et al.  Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[28]  Dong Liu,et al.  Building A Large Concept Bank for Representing Events in Video , 2014, ArXiv.

[29]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Zhigang Ma,et al.  From Concepts to Events: a Progressive Process for Multimedia content Analysis , 2013 .

[31]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[32]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Nuno Vasconcelos,et al.  Dynamic Pooling for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Fei-Fei Li,et al.  Combining the Right Features for Complex Event Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  James Allan,et al.  Zero-shot video retrieval using content and concepts , 2013, CIKM.

[36]  Patrick Bouthemy,et al.  Better Exploiting Motion for Better Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Chenliang Xu,et al.  A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Cordelia Schmid,et al.  Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Nicu Sebe,et al.  Complex Event Detection via Multi-source Video Attributes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  F. Perronnin,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[41]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[42]  Pradipto Das,et al.  Translating related words to videos and back through latent topics , 2013, WSDM.

[43]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[44]  Mubarak Shah,et al.  Recognizing Complex Events Using Large Margin Joint Low-Level Event Model , 2012, ECCV.

[45]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Paul Over,et al.  Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[50]  Gang Hua,et al.  Semantic Model Vectors for Complex Video Event Recognition , 2012, IEEE Transactions on Multimedia.

[51]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[52]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[53]  Quan Wang,et al.  Regularized latent semantic indexing , 2011, SIGIR.

[54]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[55]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[56]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[57]  Alexander C. Berg,et al.  Automatic Attribute Discovery and Characterization from Noisy Web Data , 2010, ECCV.

[58]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[59]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Hagai Attias,et al.  Topic regression multi-modal Latent Dirichlet Allocation for image annotation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[61]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[62]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[63]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[65]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[67]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[68]  Cees G. M. Snoek,et al.  Early versus late fusion in semantic video analysis , 2005, ACM Multimedia.

[69]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[70]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[71]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[72]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[73]  G. K. Tam,et al.  Event Fisher Vectors: Robust Encoding Visual Diversity of Visual Streams , 2015 .

[74]  Jason J. Corso,et al.  Multimedia event detection with multimodal feature fusion and temporal concept localization , 2013, Machine Vision and Applications.

[75]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[76]  Mubarak Shah,et al.  Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[77]  Kilian Q. Weinberger,et al.  Large Margin Taxonomy Embedding for Document Categorization , 2008, NIPS.