论文信息 - Integrating Multi-Label Contrastive Learning With Dual Adversarial Graph Neural Networks for Cross-Modal Retrieval

Integrating Multi-Label Contrastive Learning With Dual Adversarial Graph Neural Networks for Cross-Modal Retrieval

With the growing amount of multimodal data, cross-modal retrieval has attracted more and more attention and become a hot research topic. To date, most of the existing techniques mainly convert multimodal data into a common representation space where similarities in semantics between samples can be easily measured across multiple modalities. However, these approaches may suffer from the following limitations: 1) They overcome the modality gap by introducing loss in the common representation space, which may not be sufficient to eliminate the heterogeneity of various modalities; 2) They treat labels as independent entities and ignore label relationships, which is not conducive to establishing semantic connections across multimodal data; 3) They ignore the non-binary values of label similarity in multi-label scenarios, which may lead to inefficient alignment of representation similarity with label similarity. To tackle these problems, in this article, we propose two models to learn discriminative and modality-invariant representations for cross-modal retrieval. First, the dual generative adversarial networks are built to project multimodal data into a common representation space. Second, to model label relation dependencies and develop inter-dependent classifiers, we employ multi-hop graph neural networks (consisting of Probabilistic GNN and Iterative GNN), where the layer aggregation mechanism is suggested for using propagation information of various hops. Third, we propose a novel soft multi-label contrastive loss for cross-modal retrieval, with the soft positive sampling probability, which can align the representation similarity and the label similarity. Additionally, to adapt to incomplete-modal learning, which can have wider applications, we propose a modal reconstruction mechanism to generate missing features. Extensive experiments on three widely used benchmark datasets, i.e., NUS-WIDE, MIRFlickr, and MS-COCO, show the superiority of our proposed method.

Shengsheng Qian | Quan Fang | Changsheng Xu | Dizhan Xue

[1] Shengsheng Qian,et al. Adaptive Label-Aware Graph Convolutional Networks for Cross-Modal Retrieval , 2022, IEEE Transactions on Multimedia.

[2] Peng Hu,et al. Learning Cross-Modal Retrieval with Noisy Labels , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Changsheng Xu,et al. Dual Adversarial Graph Neural Networks for Multi-label Cross-modal Retrieval , 2021, AAAI.

[4] Linchao Zhu,et al. T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Shengsheng Qian,et al. HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] Jun Hu,et al. Efficient Graph Deep Learning in TensorFlow with tf_geometric , 2021, ACM Multimedia.

[7] Felix Mohr,et al. AutoML for Multi-Label Classification: Overview and Empirical Evaluation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Seong Joon Oh,et al. Probabilistic Embeddings for Cross-Modal Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Xiang Li,et al. Instance-Level Heterogeneous Domain Adaptation for Limited-Labeled Sketch-to-Photo Retrieval , 2020, IEEE Transactions on Multimedia.

[10] Zijian Wang,et al. Deep Collaborative Discrete Hashing with Semantic-Invariant Structure , 2019, SIGIR.

[11] Yuxin Peng,et al. CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning , 2021 .

[12] Qingming Huang,et al. Learning Feature Representation and Partial Correlation for Multimodal Multi-Label Data , 2021, IEEE transactions on multimedia.

[13] Xiu-Shen Wei,et al. Disentangling, Embedding and Ranking Label Cues for Multi-Label Image Recognition , 2021, IEEE Transactions on Multimedia.

[14] Lei Zhu,et al. Incomplete Cross-modal Retrieval with Dual-Aligned Variational Autoencoders , 2020, ACM Multimedia.

[15] Song Liu,et al. Joint-modal Distribution-based Similarity Hashing for Large-scale Unsupervised Deep Cross-modal Retrieval , 2020, SIGIR.

[16] Ce Liu,et al. Supervised Contrastive Learning , 2020, NeurIPS.

[17] Jinwen Ma,et al. Multi-Label Classification with Label Graph Superimposing , 2019, AAAI.

[18] Chengqi Zhang,et al. Learning Graph Embedding With Adversarial Training Methods , 2019, IEEE Transactions on Cybernetics.

[19] Jun Guo,et al. Collective Affinity Learning for Partial Cross-Modal Hashing , 2020, IEEE Transactions on Image Processing.

[20] Chao Zhang,et al. Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Xianglong Liu,et al. Graph Convolutional Network Hashing for Cross-Modal Retrieval , 2019, IJCAI.

[22] Jing Jiang,et al. Attributed Graph Clustering: A Deep Attentional Embedding Approach , 2019, IJCAI.

[23] Xiu-Shen Wei,et al. Multi-Label Image Recognition With Graph Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Qingming Huang,et al. Multi-modal semantic autoencoder for cross-modal retrieval , 2019, Neurocomputing.

[25] Zhi-Hua Zhou,et al. Fast Multi-Instance Multi-Label Learning , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Philip S. Yu,et al. A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[27] Wei Hu,et al. Generalized Graph Convolutional Networks for Skeleton-based Action Recognition , 2018, ArXiv.

[28] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[29] Hao Ma,et al. GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs , 2018, UAI.

[30] Dahua Lin,et al. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[31] Ruoyu Li,et al. Adaptive Graph Convolutional Neural Networks , 2018, AAAI.

[32] Pietro Liò,et al. Graph Attention Networks , 2017, ICLR.

[33] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[34] Yang Yang,et al. Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[35] Kien A. Hua,et al. Linear Subspace Ranking Hashing for Cross-Modal Retrieval , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Jure Leskovec,et al. Inductive Representation Learning on Large Graphs , 2017, NIPS.

[37] Max Welling,et al. Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[38] Matthijs Douze,et al. FastText.zip: Compressing text classification models , 2016, ArXiv.

[39] Tieniu Tan,et al. Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40] Changsheng Xu,et al. Multi-modal Multi-view Topic-opinion Mining for Social Event Analysis , 2016, ACM Multimedia.

[41] Xavier Bresson,et al. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[42] Ivor W. Tsang,et al. Co-Labeling for Multi-View Weakly Labeled Learning , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43] Changsheng Xu,et al. Multi-Modal Event Topic Model for Social Event Analysis , 2016, IEEE Transactions on Multimedia.

[44] Richard S. Zemel,et al. Gated Graph Sequence Neural Networks , 2015, ICLR.

[45] Chong-Wah Ngo,et al. Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46] Alán Aspuru-Guzik,et al. Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[47] Joan Bruna,et al. Deep Convolutional Networks on Graph-Structured Data , 2015, ArXiv.

[48] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49] Victor S. Lempitsky,et al. Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[50] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[51] Ruifan Li,et al. Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[52] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[53] Xiaohua Zhai,et al. Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[54] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[55] Joan Bruna,et al. Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[56] Qi Tian,et al. Multimedia search reranking: A literature survey , 2014, CSUR.

[57] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.

[58] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[59] Andrew L. Maas. Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[60] Nitish Srivastava,et al. Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[61] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[62] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[63] Bart Thomee,et al. New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[64] Tat-Seng Chua,et al. NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[65] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[66] Ah Chung Tsoi,et al. The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[67] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[68] Shotaro Akaho,et al. A kernel method for canonical correlation analysis , 2006, ArXiv.

[69] Ishwar K. Sethi,et al. Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[70] Geoffrey E. Hinton,et al. A general framework for parallel distributed processing , 1986 .

[71] H. Hotelling. Relations Between Two Sets of Variates , 1936 .