Dual Subspaces with Adversarial Learning for Cross-Modal Retrieval

Learning an effective subspace to calculate the correlation of items from different modalities is the core of cross-modal retrieval task, such as image, text or latent subspace. However, data in different modalities have imbalance and complementary relationships. Image contains abundant spatial information while text includes more background and context details. In this paper, we propose a model with dual parallel subspaces (visual and textual subspace) to better preserve modality-specific information. Triplet constraints are employed to minimize the semantic gap between items from different modalities with the same concept, while maximize that of concept-different image-text pair in corresponding subspace. Then we novelly combine adversarial learning with dual subspaces, which act as an interplay of two agents. The first agent, dual subspaces with similarity merging and concept prediction, aims to narrow the difference of data distributions from different modalities under the premise of concept invariance to fool the other agent, modality discriminator, which tries to distinguish image from text accurately. Extensive experiments on Wikipedia dataset and NUS-WIDE-10k dataset verify the effectiveness of our proposed model for cross-modal retrieval tasks, which outperforms the state-of-the-art methods.

[1]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[2]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[4]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[6]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[7]  Alberto Del Bimbo,et al.  A Cross-media Model for Automatic Image Annotation , 2014, ICMR.

[8]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[10]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[11]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[12]  Wei Wang,et al.  Continuum regression for cross-modal multimedia retrieval , 2012, 2012 19th IEEE International Conference on Image Processing.

[13]  Xirong Li,et al.  Word2VisualVec: Cross-Media Retrieval by Visual Feature Prediction , 2016, ArXiv.

[14]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Yuxin Peng,et al.  Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network , 2017, IEEE Transactions on Image Processing.

[16]  Yuxin Peng,et al.  CM-GANs , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[17]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[18]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[19]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.