E-LAMP: integration of innovative ideas for multimedia event detection

Detecting multimedia events in web videos is an emerging hot research area in the fields of multimedia and computer vision. In this paper, we introduce the core methods and technologies of the framework we developed recently for our Event Labeling through Analytic Media Processing (E-LAMP) system to deal with different aspects of the overall problem of event detection. More specifically, we have developed efficient methods for feature extraction so that we are able to handle large collections of video data with thousands of hours of videos. Second, we represent the extracted raw features in a spatial bag-of-words model with more effective tilings such that the spatial layout information of different features and different events can be better captured, thus the overall detection performance can be improved. Third, different from widely used early and late fusion schemes, a novel algorithm is developed to learn a more robust and discriminative intermediate feature representation from multiple features so that better event models can be built upon it. Finally, to tackle the additional challenge of event detection with only very few positive exemplars, we have developed a novel algorithm which is able to effectively adapt the knowledge learnt from auxiliary sources to assist the event detection. Both our empirical results and the official evaluation results on TRECVID MED’11 and MED’12 demonstrate the excellent performance of the integration of these ideas.

[1]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[2]  Jorma Laaksonen,et al.  Spatial extensions to bag of visual words , 2009, CIVR '09.

[3]  Alexander G. Hauptmann,et al.  Leveraging high-level and low-level features for multimedia event detection , 2012, ACM Multimedia.

[4]  Teruko Mitamura,et al.  Multimedia event detection using visual concept signatures , 2013, Electronic Imaging.

[5]  Mubarak Shah,et al.  Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , 2010, TRECVID.

[6]  Gang Wang,et al.  Exploring knowledge of sub-domain in a multi-resolution bootstrapping framework for concept detection in news video , 2008, ACM Multimedia.

[7]  Mohan S. Kankanhalli,et al.  Proceedings of the 1st ACM international workshop on Events in multimedia , 2009, MM 2009.

[8]  Afshin Dehghan,et al.  SRI-Sarnoff AURORA System at TRECVID 2013 Multimedia Event Detection and Recounting , 2013, TRECVID.

[9]  Brian Antonishek TRECVID 2010 – An Introduction to the Goals , Tasks , Data , Evaluation Mechanisms , and Metrics , 2010 .

[10]  Shuang Wu,et al.  Multimodal feature fusion for robust event detection in web videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Wei Liu,et al.  Multimedia classification and event detection using double fusion , 2013, Multimedia Tools and Applications.

[12]  Paul Over,et al.  TRECVID 2008 - Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2010, TRECVID.

[13]  Stéphane Ayache,et al.  Classifier Fusion for SVM-Based Multimedia Semantic Indexing , 2007, ECIR.

[14]  Mohan S. Kankanhalli,et al.  Modeling and representing events in multimedia , 2011, ACM Multimedia.

[15]  Ehud Rivlin,et al.  Robust Real-Time Unusual Event Detection using Multiple Fixed-Location Monitors , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Alexander G. Hauptmann,et al.  A Framework for Classifier Adaptation for Large-Scale Multimedia Data , 2012, Proceedings of the IEEE.

[17]  Jiebo Luo,et al.  Event recognition: viewing the world with a third eye , 2008, ACM Multimedia.

[18]  Koichi Shinoda,et al.  A fast MAP adaptation technique for gmm-supervector-based video semantic indexing systems , 2011, ACM Multimedia.

[19]  Sebastian Nowozin,et al.  On feature combination for multiclass object classification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Nicu Sebe,et al.  Multimedia Event Detection Using A Classifier-Specific Intermediate Representation , 2013, IEEE Transactions on Multimedia.

[21]  Nicolas Ballas,et al.  Trajectories based descriptor for dynamic events annotation , 2011, J-MRE '11.

[22]  Nicu Sebe,et al.  Feature Selection for Multimedia Analysis by Sharing Information Among Multiple Tasks , 2013, IEEE Transactions on Multimedia.

[23]  Noel E. O'Connor,et al.  Event detection in field sports video using audio-visual features and a support vector Machine , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Igor D. D. Curcio,et al.  Detecting events by clustering videos from large media databases , 2010, EiMM '10.

[26]  Bhiksha Raj,et al.  Unsupervised Learning of Acoustic Unit Descriptors for Audio Content Representation and Classification , 2011, INTERSPEECH.

[27]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[28]  Gang Hua,et al.  IBM Research TRECVID-2010 Video Copy Detection and Multimedia Event Detection System , 2010, TRECVID.

[29]  Nicu Sebe,et al.  Knowledge adaptation for ad hoc multimedia event detection with few exemplars , 2012, ACM Multimedia.

[30]  Guy J. Brown Computational auditory scene analysis : a representational approach , 1993 .

[31]  Gertjan J. Burghouts,et al.  Performance evaluation of local colour invariants , 2009, Comput. Vis. Image Underst..

[32]  Gerald Friedland,et al.  Acoustic super models for large scale video event detection , 2011, J-MRE '11.

[33]  Gabriela Csurka,et al.  Visual categorization with bags of keypoints , 2002, eccv 2004.

[34]  Thomas Fang Zheng,et al.  Comparison of different implementations of MFCC , 2001, Journal of Computer Science and Technology.

[35]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[36]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[37]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[38]  A. G. Amitha Perera,et al.  GENIE TRECVID 2011 Multimedia Event Detection: Late-Fusion Approaches to Combine Multiple Audio-Visual features , 2011, TRECVID.

[39]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Dong Liu,et al.  BBN VISER TRECVID 2011 Multimedia Event Detection System , 2011, TRECVID.

[41]  Fei-Fei Li,et al.  Online detection of unusual events in videos via dynamic sparse coding , 2011, CVPR 2011.

[42]  Marcel Worring,et al.  The challenge problem for automated detection of 101 semantic concepts in multimedia , 2006, MM '06.

[43]  Changsheng Xu,et al.  Live sports event detection based on broadcast video and web-casting text , 2006, MM '06.

[44]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[45]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[46]  Min Chen,et al.  Video Semantic Event/Concept Detection Using a Subspace-Based Multimedia Data Mining Framework , 2008, IEEE Transactions on Multimedia.

[47]  Wei Liu,et al.  Double Fusion for Multimedia Event Detection , 2012, MMM.

[48]  Wei Liu,et al.  Informedia @ TRECVID2010 , 2010, TRECVID.

[49]  Zheng Fang,et al.  Comparison of different implementations of MFCC , 2001 .