Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression

Heterogeneous sources of information, such as images, videos, text and metadata are often used to describe different or complementary views of the same multimedia object, especially in the online news domain and in large annotated image collections. The retrieval of multimedia objects, given a multimodal query, requires the combination of several sources of information in an efficient and scalable way. Towards this direction, we provide a novel unsupervised framework for multimodal fusion of visual and textual similarities, which are based on visual features, visual concepts and textual metadata, integrating non-linear graph-based fusion and Partial Least Squares Regression. The fusion strategy is based on the construction of a multimodal contextual similarity matrix and the non-linear combination of relevance scores from query-based similarity vectors. Our framework can employ more than two modalities and high-level information, without increase in memory complexity, when compared to state-of-the-art baseline methods. The experimental comparison is done in three public multimedia collections in the multimedia retrieval task. The results have shown that the proposed method outperforms the baseline methods, in terms of Mean Average Precision and Precision@20.

[1]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[2]  James Lee Hafner,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Amy Nicole Langville,et al.  A Survey of Eigenvector Methods for Web Information Retrieval , 2005, SIAM Rev..

[4]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Jian Wang,et al.  Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning , 2015, ICMR.

[6]  Kuldip K. Paliwal,et al.  Identity verification using speech and face information , 2004, Digit. Signal Process..

[7]  Adrian Popescu,et al.  Building Reliable and Reusable Test Collections for Image Retrieval: The Wikipedia Task at ImageCLEF , 2012, IEEE MultiMedia.

[8]  Teruko Mitamura,et al.  Multimodal knowledge-based analysis in multimedia event detection , 2012, ICMR '12.

[9]  Stefan M. Rüger,et al.  An information-theoretic framework for semantic-multimedia retrieval , 2010, TOIS.

[10]  Yiannis Kompatsiaris,et al.  Retrieval of Multimedia Objects by Fusing Multiple Modalities , 2016, ICMR.

[11]  Yang Wang,et al.  Towards metric fusion on multi-view data: a cross-view based graph random walk approach , 2013, CIKM.

[12]  Ioannis Patras,et al.  Cascade of classifiers based on binary, non-binary and deep convolutional network descriptors for video concept detection , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[13]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[14]  Bogdan Ionescu,et al.  Retrieving Diverse Social Images at MediaEval 2017: Challenges, Dataset and Evaluation , 2017, MediaEval.

[15]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[16]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  Pablo N. Mendes,et al.  Improving efficiency and accuracy in multilingual entity extraction , 2013, I-SEMANTICS '13.

[18]  Yiannis Kompatsiaris,et al.  A hybrid graph-based and non-linear late fusion approach for multimedia retrieval , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[19]  Mohan S. Kankanhalli,et al.  Multimodal fusion for multimedia analysis: a survey , 2010, Multimedia Systems.

[20]  Fabio A. González,et al.  Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization , 2012, Neurocomputing.

[21]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[22]  Ivan Kitanovski,et al.  Multimodal Medical Image Retrieval , 2012, ICT Innovations.

[23]  Wei Liu,et al.  Multimedia classification and event detection using double fusion , 2013, Multimedia Tools and Applications.

[24]  Beng Chin Ooi,et al.  Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[25]  Theodora Tsikrika,et al.  The Wikipedia Image Retrieval Task , 2010, ImageCLEF.

[26]  Gabriela Csurka,et al.  Unsupervised Visual and Textual Information Fusion in CBMIR Using Graph-Based Methods , 2015, TOIS.

[27]  Xiaojun Chang,et al.  Incremental Multimodal Query Construction for Video Search , 2015, ICMR.

[28]  Larry S. Davis,et al.  Multi-Modal Image Retrieval for Complex Queries using Small Codes , 2014, ICMR.

[29]  Yueting Zhuang,et al.  Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval , 2014, ACM Multimedia.

[30]  Shih-Fu Chang,et al.  Video search reranking through random walk over document-level context graph , 2007, ACM Multimedia.

[31]  Yiannis Kompatsiaris,et al.  A Unified Model for Socially Interconnected Multimedia-Enriched Objects , 2015, MMM.

[32]  Benoit Huet,et al.  When textual and visual information join forces for multimedia retrieval , 2014, ICMR.

[33]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[34]  Gabriela Csurka,et al.  Crossing textual and visual content in different application scenarios , 2009, Multimedia Tools and Applications.

[35]  Yiannis Kompatsiaris,et al.  A multimedia interactive search engine based on graph-based and non-linear multimodal fusion , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).