Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval

Cross-modal retrieval, e.g., using an image query to search related text and vice-versa, has become a highlighted research topic, to provide flexible retrieval experience across multi-modal data. Existing approaches usually consider the so-called non-extendable cross-modal retrieval task. In this task, they learn a common latent subspace from a source set containing labeled instances of image-text pairs and then generate common representation for the instances in a target set to perform cross-modal matching. However, these method may not generalize well when the instances of the target set contains unseen classes since the instances of both the source and target set are assumed to share the same range of classes in the non-extensive cross-modal retrieval task. In this paper, we consider a more practical issue of extendable cross-modal retrieval task where instances in source and target set have disjoint classes. We propose a novel framework, termed Modal-adversarial Semantic Learning Network (MASLN), to tackle the limitation of existing methods on this practical task. Specifically, the proposed MASLN consists two subnetworks of cross-modal reconstruction and modal-adversarial semantic learning. The former minimizes the cross-modal distribution discrepancy by reconstructing each modality data mutually, with the guidelines of class embeddings as side information in the reconstruction procedure. The latter generates semantic representation to be indiscriminative for modalities, while to distinguish the modalities from the common representation via an adversarial learning mechanism. The two subnetworks are jointly trained to enhance the cross-modal semantic consistency in the learned common subspace and the knowledge transfer to instances in the target set. Comprehensive experiment on three widely-used multi-modal datasets show its effectiveness and robustness on both non-extendable and extendable cross-modal retrieval task.

[1]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[2]  Wei Liu,et al.  Asymmetric Binary Coding for Image Search , 2017, IEEE Transactions on Multimedia.

[3]  Xuelong Li,et al.  Spectral Embedded Hashing for Scalable Image Retrieval , 2014, IEEE Transactions on Cybernetics.

[4]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[5]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[6]  Huimin Lu,et al.  Deep adversarial metric learning for cross-modal retrieval , 2019, World Wide Web.

[7]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[8]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[9]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[10]  Heng Tao Shen,et al.  Deep Region Hashing for Efficient Large-scale Instance Search from Images , 2017, ArXiv.

[11]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[12]  Jingkuan Song,et al.  Binary Generative Adversarial Networks for Image Retrieval , 2017, AAAI.

[13]  Yuxin Peng,et al.  CM-GANs , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[14]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[15]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Xuelong Li,et al.  Robust Discrete Spectral Hashing for Large-Scale Image Semantic Indexing , 2015, IEEE Transactions on Big Data.

[17]  Yao Zhao,et al.  A New Evaluation Protocol and Benchmarking Results for Extendable Cross-media Retrieval , 2017, ArXiv.

[18]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Yuxin Peng,et al.  Cross-modal Common Representation Learning by Hybrid Transfer Network , 2017, IJCAI.

[20]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[21]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[22]  Yang Yang,et al.  Cross-modal Retrieval with Label Completion , 2016, ACM Multimedia.

[23]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[24]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[25]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[26]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[27]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[28]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[32]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[33]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Tao Xiang,et al.  Learning a Deep Embedding Model for Zero-Shot Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Xuelong Li,et al.  Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[36]  Yuxin Peng,et al.  MHTN: Modal-Adversarial Hybrid Transfer Network for Cross-Modal Retrieval , 2017, IEEE Transactions on Cybernetics.

[37]  Heng Tao Shen,et al.  Hashing with Angular Reconstructive Embeddings , 2018, IEEE Transactions on Image Processing.

[38]  Ke Chen,et al.  Zero-Shot Visual Recognition via Bidirectional Latent Embedding , 2016, International Journal of Computer Vision.

[39]  Xing Xu,et al.  Semi-supervised Coupled Dictionary Learning for Cross-modal Retrieval in Internet Images and Texts , 2015, ACM Multimedia.

[40]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.