A hybrid graph-based and non-linear late fusion approach for multimedia retrieval

Nowadays, multimedia retrieval has become a task of high importance, due to the need for efficient and fast access to very large and heterogeneous multimedia collections. An interesting challenge within the aforementioned task is the efficient combination of different modalities in a multimedia object and especially the fusion between textual and visual information. The fusion of multiple modalities for retrieval in an unsupervised way has been mostly based on early, weighted linear, graph-based and diffusion-based techniques. In contrast, we present a strategy for fusing textual and visual modalities, through the combination of a non-linear fusion model and a graph-based late fusion approach. The fusion strategy is based on the construction of a uniform multimodal contextual similarity matrix and the non-linear combination of relevance scores from query-based similarity vectors. The proposed late fusion approach is evaluated in the multimedia retrieval task, by applying it to two multimedia collections, namely the WIKI11 and IAPR-TC12. The experimental results indicate its superiority over the baseline method in terms of Mean Average Precision for both considered datasets.

[1]  F. Perronnin,et al.  XRCE ’ s participation to ImagEval , 2007 .

[2]  Pablo N. Mendes,et al.  Improving efficiency and accuracy in multilingual entity extraction , 2013, I-SEMANTICS '13.

[3]  Theodora Tsikrika,et al.  The Wikipedia Image Retrieval Task , 2010, ImageCLEF.

[4]  James Lee Hafner,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[7]  Gabriela Csurka,et al.  Unsupervised Visual and Textual Information Fusion in CBMIR Using Graph-Based Methods , 2015, TOIS.

[8]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[9]  Kuldip K. Paliwal,et al.  Identity verification using speech and face information , 2004, Digit. Signal Process..

[10]  Georges Quénot,et al.  Re-ranking by local re-scoring for video indexing and retrieval , 2011, CIKM '11.

[11]  Xiaojun Chang,et al.  Incremental Multimodal Query Construction for Video Search , 2015, ICMR.

[12]  Larry S. Davis,et al.  Multi-Modal Image Retrieval for Complex Queries using Small Codes , 2014, ICMR.

[13]  Qi Tian,et al.  Multimedia search reranking: A literature survey , 2014, CSUR.

[14]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[15]  Yueting Zhuang,et al.  Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval , 2014, ACM Multimedia.

[16]  Shih-Fu Chang,et al.  Video search reranking through random walk over document-level context graph , 2007, ACM Multimedia.

[17]  Yiannis Kompatsiaris,et al.  A Unified Model for Socially Interconnected Multimedia-Enriched Objects , 2015, MMM.

[18]  Gabriela Csurka,et al.  XRCE's Participation in Wikipedia Retrieval, Medical Image Modality Classification and Ad-hoc Retrieval Tasks of ImageCLEF 2010 , 2010, CLEF.

[19]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Jian Wang,et al.  Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning , 2015, ICMR.

[21]  Yang Wang,et al.  Towards metric fusion on multi-view data: a cross-view based graph random walk approach , 2013, CIKM.

[22]  Hal Daumé,et al.  A Co-training Approach for Multi-view Spectral Clustering , 2011, ICML.

[23]  Benoit Huet,et al.  When textual and visual information join forces for multimedia retrieval , 2014, ICMR.

[24]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[25]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[27]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.