Semi-supervised Coupled Dictionary Learning for Cross-modal Retrieval in Internet Images and Texts

Nowadays massive amount of images and texts has been emerging on the Internet, arousing the demand of effective cross-modal retrieval such as text-to-image search and image-to-text search. To eliminate the heterogeneity between the modalities of images and texts, the existing subspace learning methods try to learn a common latent subspace under which cross-modal matching can be performed. However, these methods usually require fully paired samples (images with corresponding texts) and also ignore the class label information along with the paired samples. This may inhibit these methods from learning an effective subspace since the correlations between two modalities are implicitly incorporated. Indeed, the class label information can reduce the semantic gap between different modalities and explicitly guide the subspace learning procedure. In addition, the large quantities of unpaired samples (images or texts) may provide useful side information to enrich the representations from learned subspace. Thus, in this paper we propose a novel model for cross-modal retrieval problem. It consists of 1) a semi-supervised coupled dictionary learning step to generate homogeneously sparse representations for different modalities based on both paired and unpaired samples; 2) a coupled feature mapping step to project the sparse representations of different modalities into a common subspace defined by class label information to perform cross-modal matching. Experiments on a large scale web image dataset MIRFlickr-1M with both fully paired and unpaired settings show the effectiveness of the proposed model on the cross-modal retrieval task.

[1]  Yue Gao,et al.  Corrections to "Exploiting Web Images for Semantic Video Indexing Via Robust Sample-Specific Loss" , 2015, IEEE Trans. Multim..

[2]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[3]  LazebnikSvetlana,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2014 .

[4]  Yi Yang,et al.  Effective transfer tagging from image to video , 2013, TOMCCAP.

[5]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[6]  Yue Gao,et al.  Exploiting Web Images for Semantic Video Indexing Via Robust Sample-Specific Loss , 2014, IEEE Transactions on Multimedia.

[7]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[8]  Yu-Chiang Frank Wang,et al.  Coupled Dictionary and Feature Space Learning with Applications to Cross-Domain Image Synthesis and Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Alberto Del Bimbo,et al.  A Cross-media Model for Automatic Image Annotation , 2014, ICMR.

[10]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[11]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.