Multimodal knowledge-based analysis in multimedia event detection

Multimedia Event Detection (MED) is a multimedia retrieval task with the goal of finding videos of a particular event in a large-scale Internet video archive, given example videos and text descriptions. We focus on the multimodal knowledge-based analysis in MED where we utilize meaningful and semantic features such as Automatic Speech Recognition (ASR) transcripts, acoustic concept indexing (i.e. 42 acoustic concepts) and visual semantic indexing (i.e. 346 visual concepts) to characterize videos in archive. We study two scenarios where we either do or do not use the provided example videos. In the former, we propose a novel Adaptive Semantic Similarity (ASS) to measure textual similarity between ASR transcripts of videos. We also incorporate acoustic concept indexing and classification to retrieve test videos, specially with too few spoken words. In the latter 'ad-hoc' scenario where we do not have any example video, we use only the event kit description to retrieve test videos ASR transcripts and visual semantics. We also propose an event-specific fusion scheme to combine textual and visual retrieval outputs. Our results show the effectiveness of the proposed ASS and acoustic concept indexing methods and their complimentary role. We also conduct a set of experiments to assess the proposed framework for the 'ad-hoc' scenario.

[1]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[2]  Jiebo Luo,et al.  Utilizing semantic word similarity measures for video retrieval , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Angela Schwering,et al.  Hybrid Model for Semantic Similarity Measurement , 2005, OTM Conferences.

[4]  A MillerGeorge,et al.  Using corpus statistics and WordNet relations for sense identification , 1998 .

[5]  Mohamed S. Kamel,et al.  Document Clustering Using Semantic Kernels Based on Term-Term Correlations , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[6]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7]  Massih-Reza Amini,et al.  Exploiting Visual Concepts to Improve Text-Based Image Retrieval , 2009, ECIR.

[8]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.

[9]  Peter Kolb,et al.  Experiments on the difference between semantic similarity and relatedness , 2009, NODALIDA.

[10]  Rong Yan,et al.  Can High-Level Concepts Fill the Semantic Gap in Video Retrieval? A Case Study With Broadcast News , 2007, IEEE Transactions on Multimedia.

[11]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[12]  A. Waibel,et al.  A one-pass decoder based on polymorphic linguistic context assignment , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[13]  Nenghai Yu,et al.  Flickr distance , 2008, ACM Multimedia.

[14]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[15]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[16]  Alexander G. Hauptmann,et al.  MoSIFT: Recognizing Human Actions in Surveillance Videos , 2009 .

[17]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[18]  Dan Roth,et al.  Robust, Light-weight Approaches to compute Lexical Similarity , 2010 .