Performance evaluation of early and late fusion methods for generic semantics indexing

This paper focuses on the comparison between two fusion methods, namely early fusion and late fusion. The former fusion is carried out at kernel level, also known as multiple kernel learning, and in the latter, the modalities are fused through logistic regression at classifier score level. Two kinds of multilayer fusion structures, differing in the quantities of feature/kernel groups in a lower fusion layer, are constructed for early and late fusion systems, respectively. The goal of these fusion methods is to put each of various features into effect and mine redundant information of the combination of them, and then to develop a generic and robust semantic indexing system to bridge semantic gap between human concepts and these low-level visual features. Performance evaluated on both TRECVID2009 and TRECVID2010 datasets demonstrates that the systems with our proposed multilayer fusion methods at kernel level perform more stably to reach the goal than the classification-score-level fusion; the most effective and robust one with highest MAP score is constructed by early fusion with two-layer equally weighted composite kernel learning.

[1]  Robert Chen,et al.  FXPAL at TRECVID 2005 , 2005, TRECVID.

[2]  Jing Huang,et al.  Spatial Color Indexing and Applications , 2004, International Journal of Computer Vision.

[3]  Sheng Tang,et al.  TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS , 2007, TRECVID.

[4]  Yihong Gong,et al.  Automatic parsing and indexing of news video , 1995, Multimedia Systems.

[5]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[6]  Benoit Huet,et al.  Hierarchical genetic fusion of possibilities , 2005 .

[7]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[8]  Lixin Fan,et al.  Categorizing Nine Visual Classes using Local Appearance Descriptors , 2004 .

[9]  Mark J. F. Gales,et al.  Combining Derivative and Parametric Kernels for Speaker Verification , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Andrew Zisserman,et al.  Scene Classification Using a Hybrid Generative/Discriminative Approach , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Yuan Dong,et al.  The France Telecom Orange Labs (Beijing) Video High-level Feature Extraction Systems –TrecVid 2009 , 2009 .

[12]  Yves Grandvalet,et al.  More efficiency in multiple kernel learning , 2007, ICML '07.

[13]  Alan F. Smeaton,et al.  A Comparison of Score, Rank and Probability-Based Fusion Methods for Video Shot Retrieval , 2005, CIVR.

[14]  A. Murat Tekalp,et al.  A high-performance shot boundary detection algorithm using multiple cues , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[15]  Cuesm,et al.  A HIGH PERFORMANCE ALGORITHM FOR SHOT BOUNDARYDETECTION USING MULTIPLE , 2007 .

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[17]  Ramin Zabih,et al.  Comparing images using color coherence vectors , 1997, MULTIMEDIA '96.

[18]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[19]  Marcel Worring,et al.  The Semantic Pathfinder: Using an Authoring Metaphor for Generic Multimedia Indexing , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Yongdong Zhang,et al.  Multimedia Evidence Fusion for Video Concept Detection via OWA Operator , 2009, MMM.

[22]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[23]  John R. Smith,et al.  IBM Research TRECVID-2009 Video Retrieval System , 2009, TRECVID.

[24]  Shree K. Nayar,et al.  Multiresolution histograms and their use for recognition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Wolfgang Effelsberg,et al.  On the detection and recognition of television commercials , 1997, Proceedings of IEEE International Conference on Multimedia Computing and Systems.

[26]  Shan Gao,et al.  The France Telecom Orange Labs (Beijing) Video Semantic Indexing Systems - TRECVID 2012 Notebook Paper , 2010, TRECVID.

[27]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[28]  Lei Cen,et al.  Fudan University at TRECVID 2008 , 2008, TRECVID.

[29]  Yun Zhai,et al.  University of Central Florida at TRECVID 2006 High-Level Feature Extraction and Video Search , 2006, TRECVID.

[30]  Wessel Kraaij,et al.  TRECVID-2009 high-level feature task: Overview (slides0 , 2005 .

[31]  Dong Wang,et al.  THU and ICRC at TRECVID 2007 , 2007, TRECVID.

[32]  Cordelia Schmid,et al.  Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).