Efficient Cross-modal Retrieval via Discriminative Deep Correspondence Model

Cross-modal retrieval has recently drawn much attention due to the widespread existence of multi-modal data, and it generally involves two challenges: how to model the correlations and how to utilize the class label information to eliminate the heterogeneity between different modalities. Most previous works mainly focus on solving the first challenge and often ignore the second one. In this paper, we propose a discriminative deep correspondence model to deal with both problems. By taking the class label information into consideration, our proposed model attempts to seamlessly combine the correspondence autoencoder (Corr-AE) and supervised correspondence neural networks (Super-Corr-NN) for cross-modal matching. The former model can learn the correspondence representations of data from different modalities, while the latter model is designed to discriminatively reduce the semantic gap between the low-level features and high-level descriptions. The extensive experiments tested on three public datasets demonstrate the effectiveness of the proposed approach in comparison with the state-of-the-art competing methods.

[1]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[2]  Christoph H. Lampert,et al.  Learning Multi-View Neighborhood Preserving Projections , 2011, ICML.

[3]  Beng Chin Ooi,et al.  Effective deep learning-based multi-modal retrieval , 2015, The VLDB Journal.

[4]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[5]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[6]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[7]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Ling Shao,et al.  Supervised Matrix Factorization Hashing for Cross-Modal Retrieval , 2016, IEEE Transactions on Image Processing.

[10]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[11]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[12]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[13]  B. S. Manjunath,et al.  Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[14]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[15]  Christoph Meinel,et al.  Deep Semantic Mapping for Cross-Modal Retrieval , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[16]  Iryna Gurevych,et al.  Learning Semantics with Deep Belief Network for Cross-Language Information Retrieval , 2012, COLING.

[17]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[18]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[19]  Antonio Torralba,et al.  Learning Aligned Cross-Modal Representations from Weakly Aligned Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.