Evaluating multimedia features and fusion for example-based event detection

Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; highlevel semantic visual concepts; and automatic speech recognition. Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods. SESAME’s G. K. Myers (B) · R. Nallapati · J. van Hout · S. Pancoast SRI International (SRI), 333 Ravenswood Avenue, Menlo Park, CA 94025, USA e-mail: gregory.myers@sri.com R. Nevatia · C. Sun Institute for Robotics and Intelligent Systems, University of Southern California (USC), Los Angeles, CA 90089-0273, USA A. Habibian · D. C. Koelma · K. E. A. van de Sande · A. W. M. Smeulders · C. G. M. Snoek University of Amsterdam (UvA), Science Park 904, P.O. Box 94323, Amsterdam 1098 GH, The Netherlands R. Nallapati IBM Thomas J Watson Research Center, 1101 Kitchawan Rd, Yorktown Heights, NY 10598, USA performance in the 2012 TRECVID MED evaluation was one of the best reported.

[1]  Gang Hua,et al.  Semantic Model Vectors for Complex Video Event Recognition , 2012, IEEE Transactions on Multimedia.

[2]  Andreas Stolcke,et al.  The ICSI-SRI Spring 2006 Meeting Recognition System , 2006, MLMI.

[3]  Yu-Gang Jiang,et al.  SUPER: towards real-time event recognition in internet videos , 2012, ICMR.

[4]  Trevor Darrell,et al.  Detection bank: an object detection based video representation for multimedia event recognition , 2012, ACM Multimedia.

[5]  Paul Over,et al.  Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[6]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[7]  Dennis Koelma,et al.  The MediaMill TRECVID 2008 Semantic Video Search Engine , 2008, TRECVID.

[8]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[9]  Marcel Worring,et al.  On the surplus value of semantic video analysis beyond the key frame , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[10]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[11]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[12]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Cor J. Veenman,et al.  Comparing compact codebooks for visual categorization , 2010, Comput. Vis. Image Underst..

[15]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  A. G. Amitha Perera,et al.  GENIE TRECVID 2011 Multimedia Event Detection: Late-Fusion Approaches to Combine Multiple Audio-Visual features , 2011, TRECVID.

[17]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[18]  Yiannis Kompatsiaris,et al.  High-level event detection in video exploiting discriminant concepts , 2011, 2011 9th International Workshop on Content-Based Multimedia Indexing (CBMI).

[19]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[20]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[21]  Arnold W. M. Smeulders,et al.  Color Invariance , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Subhransu Maji,et al.  Classification using intersection kernel support vector machines is efficient , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Koen E. A. van de Sande,et al.  Recommendations for video event recognition using concept vocabularies , 2013, ICMR.

[24]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[26]  Cordelia Schmid,et al.  Aggregating Local Image Descriptors into Compact Codes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Mubarak Shah,et al.  High-level event recognition in unconstrained videos , 2013, International Journal of Multimedia Information Retrieval.

[28]  Murat Akbacak,et al.  KDDI LABS and SRI International at TRECVID 2010: Content-Based Copy Detection , 2010, TRECVID.

[29]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Mubarak Shah,et al.  Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[31]  Samy Bengio,et al.  Large-scale content-based audio retrieval from text queries , 2008, MIR '08.

[32]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[33]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Alberto Del Bimbo,et al.  Event detection and recognition for semantic annotation of video , 2010, Multimedia Tools and Applications.

[35]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[36]  Dong Xu,et al.  Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[38]  Arnold W. M. Smeulders,et al.  Visual-Concept Search Solved? , 2010, Computer.

[39]  Rong Yan,et al.  Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.