Integrating Multi-Label Contrastive Learning With Dual Adversarial Graph Neural Networks for Cross-Modal Retrieval

With the growing amount of multimodal data, cross-modal retrieval has attracted more and more attention and become a hot research topic. To date, most of the existing techniques mainly convert multimodal data into a common representation space where similarities in semantics between samples can be easily measured across multiple modalities. However, these approaches may suffer from the following limitations: 1) They overcome the modality gap by introducing loss in the common representation space, which may not be sufficient to eliminate the heterogeneity of various modalities; 2) They treat labels as independent entities and ignore label relationships, which is not conducive to establishing semantic connections across multimodal data; 3) They ignore the non-binary values of label similarity in multi-label scenarios, which may lead to inefficient alignment of representation similarity with label similarity. To tackle these problems, in this article, we propose two models to learn discriminative and modality-invariant representations for cross-modal retrieval. First, the dual generative adversarial networks are built to project multimodal data into a common representation space. Second, to model label relation dependencies and develop inter-dependent classifiers, we employ multi-hop graph neural networks (consisting of Probabilistic GNN and Iterative GNN), where the layer aggregation mechanism is suggested for using propagation information of various hops. Third, we propose a novel soft multi-label contrastive loss for cross-modal retrieval, with the soft positive sampling probability, which can align the representation similarity and the label similarity. Additionally, to adapt to incomplete-modal learning, which can have wider applications, we propose a modal reconstruction mechanism to generate missing features. Extensive experiments on three widely used benchmark datasets, i.e., NUS-WIDE, MIRFlickr, and MS-COCO, show the superiority of our proposed method.

[1]  Shengsheng Qian,et al.  Adaptive Label-Aware Graph Convolutional Networks for Cross-Modal Retrieval , 2022, IEEE Transactions on Multimedia.

[2]  Peng Hu,et al.  Learning Cross-Modal Retrieval with Noisy Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Changsheng Xu,et al.  Dual Adversarial Graph Neural Networks for Multi-label Cross-modal Retrieval , 2021, AAAI.

[4]  Linchao Zhu,et al.  T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Shengsheng Qian,et al.  HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Jun Hu,et al.  Efficient Graph Deep Learning in TensorFlow with tf_geometric , 2021, ACM Multimedia.

[7]  Felix Mohr,et al.  AutoML for Multi-Label Classification: Overview and Empirical Evaluation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Seong Joon Oh,et al.  Probabilistic Embeddings for Cross-Modal Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xiang Li,et al.  Instance-Level Heterogeneous Domain Adaptation for Limited-Labeled Sketch-to-Photo Retrieval , 2020, IEEE Transactions on Multimedia.

[10]  Zijian Wang,et al.  Deep Collaborative Discrete Hashing with Semantic-Invariant Structure , 2019, SIGIR.

[11]  Yuxin Peng,et al.  CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning , 2021 .

[12]  Qingming Huang,et al.  Learning Feature Representation and Partial Correlation for Multimodal Multi-Label Data , 2021, IEEE transactions on multimedia.

[13]  Xiu-Shen Wei,et al.  Disentangling, Embedding and Ranking Label Cues for Multi-Label Image Recognition , 2021, IEEE Transactions on Multimedia.

[14]  Lei Zhu,et al.  Incomplete Cross-modal Retrieval with Dual-Aligned Variational Autoencoders , 2020, ACM Multimedia.

[15]  Song Liu,et al.  Joint-modal Distribution-based Similarity Hashing for Large-scale Unsupervised Deep Cross-modal Retrieval , 2020, SIGIR.

[16]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[17]  Jinwen Ma,et al.  Multi-Label Classification with Label Graph Superimposing , 2019, AAAI.

[18]  Chengqi Zhang,et al.  Learning Graph Embedding With Adversarial Training Methods , 2019, IEEE Transactions on Cybernetics.

[19]  Jun Guo,et al.  Collective Affinity Learning for Partial Cross-Modal Hashing , 2020, IEEE Transactions on Image Processing.

[20]  Chao Zhang,et al.  Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Xianglong Liu,et al.  Graph Convolutional Network Hashing for Cross-Modal Retrieval , 2019, IJCAI.

[22]  Jing Jiang,et al.  Attributed Graph Clustering: A Deep Attentional Embedding Approach , 2019, IJCAI.

[23]  Xiu-Shen Wei,et al.  Multi-Label Image Recognition With Graph Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Qingming Huang,et al.  Multi-modal semantic autoencoder for cross-modal retrieval , 2019, Neurocomputing.

[25]  Zhi-Hua Zhou,et al.  Fast Multi-Instance Multi-Label Learning , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Wei Hu,et al.  Generalized Graph Convolutional Networks for Skeleton-based Action Recognition , 2018, ArXiv.

[28]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[29]  Hao Ma,et al.  GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs , 2018, UAI.

[30]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[31]  Ruoyu Li,et al.  Adaptive Graph Convolutional Neural Networks , 2018, AAAI.

[32]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[33]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[34]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[35]  Kien A. Hua,et al.  Linear Subspace Ranking Hashing for Cross-Modal Retrieval , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[37]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[38]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[39]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Changsheng Xu,et al.  Multi-modal Multi-view Topic-opinion Mining for Social Event Analysis , 2016, ACM Multimedia.

[41]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[42]  Ivor W. Tsang,et al.  Co-Labeling for Multi-View Weakly Labeled Learning , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Changsheng Xu,et al.  Multi-Modal Event Topic Model for Social Event Analysis , 2016, IEEE Transactions on Multimedia.

[44]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[45]  Chong-Wah Ngo,et al.  Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[47]  Joan Bruna,et al.  Deep Convolutional Networks on Graph-Structured Data , 2015, ArXiv.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[50]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[51]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[52]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[53]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[54]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[55]  Joan Bruna,et al.  Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[56]  Qi Tian,et al.  Multimedia search reranking: A literature survey , 2014, CSUR.

[57]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[58]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[59]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[60]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[61]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[62]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[63]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[64]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[65]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[67]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[68]  Shotaro Akaho,et al.  A kernel method for canonical correlation analysis , 2006, ArXiv.

[69]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[70]  Geoffrey E. Hinton,et al.  A general framework for parallel distributed processing , 1986 .

[71]  H. Hotelling Relations Between Two Sets of Variates , 1936 .