CMPD: Using Cross Memory Network With Pair Discrimination for Image-Text Retrieval

Cross-modal retrieval using deep neural networks aims to retrieve relevant data between the two different modalities. The performance of cross-modal retrieval is still unsatisfactory for two problems. First, most of the previous methods failed to incorporate the common knowledge among modalities when predicting the item representations. Second, the semantic relationships indicated by class label are still insufficiently utilized, which is an important clue for inferring similarities between the cross modal items. To address the above issues, we propose a novel cross memory network with pair discrimination (CMPD) for image-text cross modal retrieval, where the main contributions lie in two-folds. First, we propose the cross memory as a set of latent concepts to capture the common knowledge among different modalities. It is learnable and can be fused into each modality through attention mechanism, which aims to discriminatively predict representations. Second, we propose the pair discrimination loss to discriminate modality labels and class labels of item pairs, which can efficiently capture the semantic relationships among these modality labels and class labels. Comprehensive experimental results show that our method outperforms the state-of-the-art approaches in image-text retrieval.

[1]  Jacob Abernethy,et al.  On Convergence and Stability of GANs , 2018 .

[2]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[3]  Jiwen Lu,et al.  Deep Coupled Metric Learning for Cross-Modal Matching , 2017, IEEE Transactions on Multimedia.

[4]  Meng Wang,et al.  Learning Visual Semantic Relationships for Efficient Visual Retrieval , 2015, IEEE Transactions on Big Data.

[5]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[6]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[7]  An-An Liu,et al.  3D Object Retrieval Based on Multi-View Latent Variable Model , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[9]  Jianmin Wang,et al.  Collective Deep Quantization for Efficient Cross-Modal Retrieval , 2017, AAAI.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Xuelong Li,et al.  Deep Binary Reconstruction for Cross-Modal Hashing , 2017, IEEE Transactions on Multimedia.

[12]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Liming Chen,et al.  DeepVisage: Making Face Recognition Simple Yet With Powerful Generalization Skills , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[15]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[17]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[18]  Wei Liu,et al.  Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval , 2017, AAAI.

[19]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[22]  Shiming Xiang,et al.  Cross-Modal Hashing via Rank-Order Preserving , 2017, IEEE Transactions on Multimedia.

[23]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[24]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[26]  Huimin Lu,et al.  Unsupervised cross-modal retrieval through adversarial learning , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[27]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[29]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[31]  Qian Huang,et al.  Multimedia search and retrieval: new concepts, system implementation, and application , 2000, IEEE Trans. Circuits Syst. Video Technol..

[32]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[33]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[34]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Xinbo Gao,et al.  Triplet-Based Deep Hashing Network for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[36]  Yuxin Peng,et al.  MHTN: Modal-Adversarial Hybrid Transfer Network for Cross-Modal Retrieval , 2017, IEEE Transactions on Cybernetics.

[37]  Xin Wen,et al.  Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[38]  Quan Wang,et al.  Robust and Flexible Discrete Hashing for Cross-Modal Similarity Search , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Yueting Zhuang,et al.  Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[41]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[42]  Xin Luo,et al.  SCRATCH: A Scalable Discrete Matrix Factorization Hashing Framework for Cross-Modal Retrieval , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[43]  Miki Haseyama,et al.  A Cross-Modal Approach for Extracting Semantic Relationships Between Concepts Using Tagged Images , 2014, IEEE Transactions on Multimedia.

[44]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[45]  Bernd Girod,et al.  Large-Scale Video Retrieval Using Image Queries , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[46]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[47]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[48]  Xuelong Li,et al.  Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[49]  Edward Y. Chang,et al.  CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines , 2003, IEEE Trans. Circuits Syst. Video Technol..