Multi-modal Correlated Centroid Space for Multi-lingual Cross-Modal Retrieval

We present a novel cross-modal retrieval approach where the textual modality is present in different languages. We retrieve semantically similar documents across modalities in different languages using a correlated centroid space unsupervised retrieval (C2SUR) approach. C2SUR consists of two phases. In the first phase, we extract heterogeneous features from a multi-modal document and project it to a correlated space using kernel canonical correlation analysis (KCCA). In the second phase, correlated space centroids are obtained using clustering to retrieve cross-modal documents with different similarity measures. Experimental results show that C2SUR outperforms the existing state-of-the-art English cross-modal retrieval approaches and achieve similar results for other languages.

[1]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[2]  Florian Metze,et al.  Beyond audio and video retrieval: topic-oriented multimedia summarization , 2012, International Journal of Multimedia Information Retrieval.

[3]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[4]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[5]  Victor Lavrenko,et al.  Sparse Kernel Learning for Image Annotation , 2014, ICMR.

[6]  Carol Peters,et al.  Cross-Language Information Retrieval , 2012 .

[7]  Herbert A. Simon,et al.  Why a Diagram is (Sometimes) Worth Ten Thousand Words , 1987, Cogn. Sci..

[8]  Nikhil Rasiwasia,et al.  Cluster Canonical Correlation Analysis , 2014, AISTATS.

[9]  Chong-Wah Ngo,et al.  Novelty detection for cross-lingual news stories with visual duplicates and speech transcripts , 2007, ACM Multimedia.

[10]  Petros Daras,et al.  A unified framework for multimodal retrieval , 2013, Pattern Recognit..

[11]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[12]  Yueting Zhuang,et al.  Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[13]  Christoph H. Lampert,et al.  Correlational spectral clustering , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Azadeh Shakery,et al.  Leveraging comparable corpora for cross-lingual information retrieval in resource-lean language pairs , 2012, Information Retrieval.

[15]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Matthieu Guillaumin,et al.  Combining Image-Level and Segment-Level Models for Automatic Annotation , 2012, MMM.

[17]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Quan Pan,et al.  Semi-coupled dictionary learning with applications to image super-resolution and photo-sketch synthesis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Rada Mihalcea,et al.  Cross-lingual Semantic Relatedness Using Encyclopedic Knowledge , 2009, EMNLP.

[20]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[21]  C. V. Jawahar,et al.  Image Retrieval Using Textual Cues , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Xiaohua Zhai,et al.  Effective Heterogeneous Similarity Measure with Nearest Neighbors for Cross-Media Retrieval , 2012, MMM.