Learning to detect video events from zero or very few video examples

In this work we deal with the problem of high-level event detection in video. Specifically, we study the challenging problems of i) learning to detect video events from solely a textual description of the event, without using any positive video examples, and ii) additionally exploiting very few positive training samples together with a small number of "related" videos. For learning only from an event's textual description, we first identify a general learning framework and then study the impact of different design choices for various stages of this framework. For additionally learning from example videos, when true positive training samples are scarce, we employ an extension of the Support Vector Machine that allows us to exploit "related" event videos by automatically introducing different weights for subsets of the videos in the overall training set. Experimental evaluations performed on the large-scale TRECVID MED 2014 video dataset provide insight on the effectiveness of the proposed methods. We deal with the challenging problem of high-level event detection in video.We build event detectors based solely on textual descriptions of the event classes.We also learn event detectors from very few positive and related training samples.We present results and comparisons on a large-scale TRECVID MED video dataset.

[1]  Shuang Wu,et al.  Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  SánchezJorge,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012 .

[3]  GeversTheo,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010 .

[4]  Shih-Fu Chang,et al.  Event detection in baseball video using superimposed caption recognition , 2002, MULTIMEDIA '02.

[5]  Teruko Mitamura,et al.  Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[6]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[7]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Cordelia Schmid,et al.  The INRIA-LIM-VocR and AXES submissions to TrecVid 2014 Multimedia Event Detection , 2014, TRECVID.

[9]  Kan Chen,et al.  The 2013 SESAME Multimedia Event Detection and Recounting System , 2013, TRECVID.

[10]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[11]  Cees Snoek,et al.  Composite Concept Discovery for Zero-Shot Video Event Detection , 2014, ICMR.

[12]  Vasileios Mezaris,et al.  Video event detection using generalized subclass discriminant analysis and linear support vector machines , 2014, ICMR.

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Afshin Dehghan,et al.  SRI-Sarnoff AURORA System at TRECVID 2013 Multimedia Event Detection and Recounting , 2013, TRECVID.

[15]  N. Brown On The Prevalence of Event Clusters in Autobiographical Memory , 2005 .

[16]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[17]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[19]  Yiannis Kompatsiaris,et al.  ITI-CERTH participation to TRECVID 2015 , 2015, TRECVID.

[20]  Cees Snoek,et al.  VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events , 2014, ACM Multimedia.

[21]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  CHENGXIANG ZHAI,et al.  A study of smoothing methods for language models applied to information retrieval , 2004, TOIS.

[23]  Shiguang Shan,et al.  Informedia@TrecVID 2014: MED and MER , 2014 .

[24]  Yiannis Kompatsiaris,et al.  Improving event detection using related videos and relevance degree support vector machines , 2013, MM '13.

[25]  Rohini K. Srihari,et al.  Incorporating prior knowledge with weighted margin support vector machines , 2004, KDD.

[26]  Cees Snoek,et al.  COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Andrew Zisserman,et al.  The devil is in the details: an evaluation of recent feature encoding methods , 2011, BMVC.

[28]  Deyu Meng,et al.  Bridging the Ultimate Semantic Gap: A Semantic Search Engine for Internet Videos , 2015, ICMR.

[29]  Yiannis Kompatsiaris,et al.  Video event detection using a subclass recoding error-correcting output codes framework , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[30]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31]  Teruko Mitamura,et al.  Multimodal knowledge-based analysis in multimedia event detection , 2012, ICMR '12.

[32]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[33]  Babak Saleh,et al.  Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[35]  Larry S. Davis,et al.  AVSS 2011 demo session: A large-scale benchmark dataset for event recognition in surveillance video , 2011, AVSS.

[36]  Marcel Worring,et al.  Bootstrapping Visual Categorization With Relevant Negatives , 2013, IEEE Transactions on Multimedia.

[37]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[38]  Dong Liu,et al.  BBNVISER : BBN VISER TRECVID 2012 Multimedia Event Detection and Multimedia Event Recounting Systems , 2012, TRECVID.

[39]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[40]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[41]  Rong Yan,et al.  Negative pseudo-relevance feedback in content-based video retrieval , 2003, MULTIMEDIA '03.

[42]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[43]  André Freitas,et al.  EasyESA: A Low-effort Infrastructure for Explicit Semantic Analysis , 2014, SEMWEB.