Retrieval of Multimedia Objects by Fusing Multiple Modalities

Effective multimedia retrieval requires the combination of the heterogeneous media contained within multimedia objects and the features that can be extracted from them. To this end, we extend a unifying framework that integrates all well-known weighted, graph-based, and diffusion-based fusion techniques that combine two modalities (textual and visual similarities) to model the fusion of multiple modalities. We also provide a theoretical formula for the optimal number of documents that need to be initially selected, so that the memory cost in the case of multiple modalities remains the same as in the case of two modalities. Experiments using two test collections and three modalities (similarities based on visual descriptors, visual concepts, and textual concepts) indicate improvements in the effectiveness over bimodal fusion under the same memory complexity.

[1]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[3]  S SawhneyHarpreet,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995 .

[4]  Gabriela Csurka,et al.  Unsupervised Visual and Textual Information Fusion in CBMIR Using Graph-Based Methods , 2015, TOIS.

[5]  Xiaojun Chang,et al.  Incremental Multimodal Query Construction for Video Search , 2015, ICMR.

[6]  Larry S. Davis,et al.  Multi-Modal Image Retrieval for Complex Queries using Small Codes , 2014, ICMR.

[7]  Jian Wang,et al.  Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning , 2015, ICMR.

[8]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  James Lee Hafner,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Gabriela Csurka,et al.  Comparison of Several Combinations of Multimodal and Diversity Seeking Methods for Multimedia Retrieval , 2009, CLEF.

[12]  Benoit Huet,et al.  When textual and visual information join forces for multimedia retrieval , 2014, ICMR.

[13]  GeversTheo,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010 .

[14]  Shih-Fu Chang,et al.  Video search reranking through random walk over document-level context graph , 2007, ACM Multimedia.

[15]  Georges Quénot,et al.  Re-ranking by local re-scoring for video indexing and retrieval , 2011, CIKM '11.

[16]  Yang Wang,et al.  Towards metric fusion on multi-view data: a cross-view based graph random walk approach , 2013, CIKM.