Leveraging high-level and low-level features for multimedia event detection

This paper addresses the challenge of Multimedia Event Detection by proposing a novel method for high-level and low-level features fusion based on collective classification. Generally, the method consists of three steps: training a classifier from low-level features; encoding high-level features into graphs; and diffusing the scores on the established graph to obtain the final prediction. The final prediction is derived from multiple graphs each of which corresponds to a high-level feature. The paper investigates two graph construction methods using logarithmic and exponential loss functions, respectively and two collective classification algorithms, i.e. Gibbs sampling and Markov random walk. The theoretical analysis demonstrates that the proposed method converges and is computationally scalable and the empirical analysis on TRECVID 2011 Multimedia Event Detection dataset validates its outstanding performance compared to state-of-the-art methods, with an added benefit of interpretability.

[1]  Cor J. Veenman,et al.  Kernel Codebooks for Scene Categorization , 2008, ECCV.

[2]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[3]  Hugo Jair Escalante,et al.  Late fusion of heterogeneous methods for multimedia image retrieval , 2008, MIR '08.

[4]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[5]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[6]  Edward Y. Chang,et al.  Optimal multimodal fusion for multimedia data analysis , 2004, MULTIMEDIA '04.

[7]  Kalyan Moy Gupta,et al.  Cautious Inference in Collective Classification , 2007, AAAI.

[8]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[9]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[10]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[11]  Foster J. Provost,et al.  Classification in Networked Data: a Toolkit and a Univariate Case Study , 2007, J. Mach. Learn. Res..

[12]  Jennifer Neville,et al.  Across-Model Collective Ensemble Classification , 2011, AAAI.

[13]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[14]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[15]  Joo-Hwee Lim,et al.  Latent semantic fusion model for image retrieval and annotation , 2007, CIKM '07.

[16]  Stéphane Marchand-Maillet,et al.  Information Fusion in Multimedia Information Retrieval , 2007, Adaptive Multimedia Retrieval.