GENIE TRECVID 2011 Multimedia Event Detection: Late-Fusion Approaches to Combine Multiple Audio-Visual features

Abstract : For TRECVID 2011 MED task, the GENIE system incorporated two late-fusion approaches where multiple discriminative base-classifiers are built per feature, then, combined later through discriminative fusion techniques. All of our fusion and base classifiers are formulated as one-vs-all detectors per event class along with threshold estimation capabilities during cross-validation. Total of five different types of features were extracted from data, which include both audio or visual features: HOG3D, Object Bank, Gist, MFCC, and acoustic segment models (ASMs). Features such as HOG3D and MFCC are low-level features while Object Bank and ASMs are more semantic. In our work, event-specific feature adaptations or manual annotations were deliberately avoided, to establish a strong baseline results. Overall, the results were competitive in the MED11 evaluation, and shows that standard machine learning techniques can yield fairly good results even on a challenging dataset.

[1]  Chin-Hui Lee,et al.  Automatic Image Annotation through Multi-Topic Text Categorization , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[2]  Qi Tian,et al.  Discriminative Fusion Approach for Automatic Image Annotation , 2005, 2005 IEEE 7th Workshop on Multimedia Signal Processing.

[3]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[4]  Chin-Hui Lee,et al.  A MFoM learning approach to robust multiclass multi-label text categorization , 2004, ICML.

[5]  Chin-Hui Lee,et al.  Optimization of average precision with Maximal Figure-of-Merit Learning , 2011, 2011 IEEE International Workshop on Machine Learning for Signal Processing.

[6]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[7]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.