Multi-Graph Based Hierarchical Semantic Fusion for Cross-Modal Representation

The main challenge of cross-modal retrieval is how to efficiently realize semantic alignment and reduce the heterogeneity gap. However, existing approaches ignore the multi-grained semantic knowledge learning from different modalities. To this end, this paper proposes a novel end-to-end cross-modal representation method, termed as Multi-Graph based Hierarchical Semantic Fusion (MG-HSF). This method is an integration of multi-graph hierarchical semantic fusion with cross-modal adversarial learning, which captures fine-grained and coarse-grained semantic knowledge from cross-modal samples, and generate modalities-invariant representations in a common subspace. To evaluate the performance, extensive experiments are conducted on three benchmarks. The experimental results show that our method is superior than the state-of-the-arts.

[1]  Lei Zhu,et al.  CAESAR: concept augmentation based semantic representation for cross-modal retrieval , 2020, Multimedia Tools and Applications.

[2]  Ke Zhou,et al.  Fast Graph Convolution Network Based Multi-label Image Recognition via Cross-modal Fusion , 2020, CIKM.

[3]  Pranav Aggarwal,et al.  Multi-Modal Retrieval using Graph Neural Networks , 2020, ArXiv.

[4]  Zengchang Qin,et al.  KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue , 2020, ACM Multimedia.

[5]  Yang Wang,et al.  Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry, and Fusion , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[6]  Yuxin Peng,et al.  MHTN: Modal-Adversarial Hybrid Transfer Network for Cross-Modal Retrieval , 2017, IEEE Transactions on Cybernetics.

[7]  Xianglong Liu,et al.  Graph Convolutional Network Hashing for Cross-Modal Retrieval , 2019, IJCAI.

[8]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yale Song,et al.  Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jun Wang,et al.  Ranking-based Deep Cross-modal Hashing , 2019, AAAI.

[11]  Ling Shao,et al.  Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[12]  Jinhui Tang,et al.  Semantic Neighbor Graph Hashing for Multimodal Retrieval , 2018, IEEE Transactions on Image Processing.

[13]  Yuxin Peng,et al.  Unsupervised Generative Adversarial Cross-modal Hashing , 2017, AAAI.

[14]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[15]  Huimin Lu,et al.  Unsupervised cross-modal retrieval through adversarial learning , 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME).

[16]  Wei Liu,et al.  Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval , 2017, AAAI.

[17]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[18]  Lin Wu,et al.  Effective Multi-Query Expansions: Collaborative Deep Networks for Robust Landmark Retrieval , 2017, IEEE Transactions on Image Processing.

[19]  Lin Wu,et al.  Iterative Views Agreement: An Iterative Low-Rank Based Structured Optimization Method to Multi-View Spectral Clustering , 2016, IJCAI.

[20]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Xiangyang Xue,et al.  Cross-Modal Image Clustering via Canonical Correlation Analysis , 2015, AAAI.

[22]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Jing Yu,et al.  Cross-modal topic correlations for multimedia retrieval , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[25]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[26]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[27]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[28]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..