Fusion methods for multi-modal indexing of web data

Effective indexing of multimedia documents requires a multimodal approach in which either the most appropriate modality is selected or different modalities are used in a collaborative fashion. A collaborative pattern is a model of combination between media that defines how and when to combine information coming from different media sources. Fusing information coming from different media seems a natural way to handle multimedia content. We focus on describing fusion strategies where the task is achieved through the use of different modalities. We browse through the literature looking at various state of the art multi-modal fusion techniques varying from naive combination of modalities to more complex methods of machine learning and discuss various issues faced with fusing several modalities having different properties in the context of semantic indexing.

[1]  Mohan S. Kankanhalli,et al.  Experiential Sampling in Multimedia Systems , 2006, IEEE Transactions on Multimedia.

[2]  Bir Bhanu,et al.  Tracking Humans using Multi-modal Fusion , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[3]  Alan F. Smeaton,et al.  A Comparison of Score, Rank and Probability-Based Fusion Methods for Video Shot Retrieval , 2005, CIVR.

[4]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[5]  Chin-Hui Lee,et al.  Explicit Performance Metric Optimization for Fusion-Based Video Retrieval , 2012, ECCV Workshops.

[6]  Cordelia Schmid,et al.  Multimodal semi-supervised learning for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Mohan S. Kankanhalli,et al.  Portfolio theory of multimedia fusion , 2010, ACM Multimedia.

[8]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[9]  John R. Smith,et al.  Semantic Indexing of Multimedia Content Using Visual, Audio, and Text Cues , 2003, EURASIP J. Adv. Signal Process..

[10]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[11]  Xian-Sheng Hua,et al.  An Attention-Based Decision Fusion Scheme for Multimedia Information Retrieval , 2004, PCM.

[12]  Matthieu Cord,et al.  Hybrid Pooling Fusion in the BoW Pipeline , 2012, ECCV Workshops.

[13]  Georges Quénot,et al.  Fusion of Speech, Faces and Text for Person Identification in TV Broadcast , 2012, ECCV Workshops.

[14]  Rong Yan,et al.  Learning query-class dependent weights in automatic video retrieval , 2004, MULTIMEDIA '04.

[15]  Georges Quénot,et al.  Hierarchical Late Fusion for Concept Detection in Videos , 2012, ECCV Workshops.

[16]  Emmanuel Dellandréa,et al.  A Selective Weighted Late Fusion for Visual Concept Recognition , 2012, ECCV Workshops.