Adversarial Learning-Based Semantic Correlation Representation for Cross-Modal Retrieval

Cross-modal retrieval has become a hot issue in past years. Many existing works pay attentions on correlation learning to generate a common subspace for cross-modal correlation measurement, and others use adversarial learning technique to abate the heterogeneity of multimodal data. However, very few works combine correlation learning and adversarial learning to bridge the intermodal semantic gap and diminish cross-modal heterogeneity. This article proposes a novel cross-modal retrieval method, named Adversarial Learning based Semantic COrrelation Representation (ALSCOR), which is an end-to-end framework to integrate cross-modal representation learning, correlation learning, and adversarial. Canonical correlation analysis model, combined with VisNet and TxtNet, is proposed to capture cross-modal nonlinear correlation. Besides, intramodal classifier and modality classifier are used to learn intramodal discrimination and minimize the intermodal heterogeneity. Comprehensive experiments are conducted on three benchmark datasets. The results demonstrate that the proposed ALSCOR has better performance than the state of the arts.

[1]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[2]  LazebnikSvetlana,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2014 .

[3]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[4]  Lin Wu,et al.  Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification , 2018, IEEE Transactions on Multimedia.

[5]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[6]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[7]  Lei Zhu,et al.  An Efficient Approach for Geo-Multimedia Cross-Modal Retrieval , 2018, IEEE Access.

[8]  Yang Wang,et al.  Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry, and Fusion , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[9]  Yang Wang,et al.  Fully-Convolutional Intensive Feature Flow Neural Network for Text Recognition , 2020, ECAI.

[10]  Lin Wu,et al.  Multiview Spectral Clustering via Structured Low-Rank Matrix Factorization , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[12]  Xin Wen,et al.  Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[13]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[14]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[15]  Jian Yang,et al.  Unsupervised Discriminant Canonical Correlation Analysis for Feature Fusion , 2014, 2014 22nd International Conference on Pattern Recognition.

[16]  Christoph Meinel,et al.  Deep Semantic Mapping for Cross-Modal Retrieval , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[17]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[18]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.