Cross-Modal Generation and Pair Correlation Alignment Hashing

Cross-modal hashing is an effective cross-modal retrieval approach because of its low storage and high efficiency. However, most existing methods mainly utilize pre-trained networks to extract modality-specific features, while ignore the position information and lack information interaction between different modalities. To address those problems, in this paper, we propose a novel approach, named cross-modal generation and pair correlation alignment hashing (CMGCAH), which introduces transformer to exploit position information and utilizes cross-modal generative adversarial networks (GAN) to boost cross-modal information interaction. Concretely, a cross-modal interaction network based on conditional generative adversarial network and pair correlation alignment networks are proposed to generate cross-modal common representations. On the other hand, a transformer-based feature extraction network (TFEN) is designed to exploit position information, which can be propagated to text modality and enforce the common representation to be semantically consistent. Experiments are performed on widely used datasets with text-image modalities, and results show that the proposed method achieved competitive performance compared with many existing methods.

[1]  Yan Ma,et al.  Semantic-guided autoencoder adversarial hashing for large-scale cross-modal retrieval , 2022, Complex & Intelligent Systems.

[2]  Xin-Shun Xu,et al.  Fast Cross-Modal Hashing With Global and Local Similarity Embedding , 2021, IEEE Transactions on Cybernetics.

[3]  Zi Huang,et al.  Aggregation-Based Graph Convolutional Hashing for Unsupervised Cross-Modal Retrieval , 2021, IEEE Transactions on Multimedia.

[4]  Yu-Gang Jiang,et al.  Spatial-Temporal Graphs for Cross-Modal Text2Video Retrieval , 2022, IEEE Transactions on Multimedia.

[5]  Zi Huang,et al.  Exploiting Subspace Relation in Semantic Labels for Cross-Modal Hashing , 2021, IEEE Transactions on Knowledge and Data Engineering.

[6]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[7]  Di Wang,et al.  Joint and individual matrix factorization hashing for large-scale cross-modal retrieval , 2020, Pattern Recognit..

[8]  Yuan Wan,et al.  Deep semantic similarity adversarial hashing for cross-modal retrieval , 2020, Neurocomputing.

[9]  Huaxiang Zhang,et al.  Deep Collaborative Multi-View Hashing for Large-Scale Image Search , 2020, IEEE Transactions on Image Processing.

[10]  Changsheng Xu,et al.  Multi-Level Correlation Adversarial Hashing for Cross-Modal Retrieval , 2020, IEEE Transactions on Multimedia.

[11]  Xinbo Gao,et al.  Label Consistent Matrix Factorization Hashing for Large-Scale Cross-Modal Similarity Search , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Lei Zhu,et al.  Online Multi-modal Hashing with Dynamic Query-adaption , 2019, SIGIR.

[13]  Jie Yang,et al.  Densely-Connected Multi-Magnification Hashing for Histopathological Image Retrieval , 2019, IEEE Journal of Biomedical and Health Informatics.

[14]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Gang Wang,et al.  Online latent semantic hashing for cross-media retrieval , 2019, Pattern Recognit..

[16]  Jun Zhou,et al.  Cross-modal hashing with semantic deep embedding , 2019, Neurocomputing.

[17]  Lianli Gao,et al.  Deep adversarial metric learning for cross-modal retrieval , 2019, World Wide Web.

[18]  Xinbo Gao,et al.  Triplet-Based Deep Hashing Network for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[19]  Nicu Sebe,et al.  A Survey on Learning to Hash , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Zhimin Zeng,et al.  Combining Link and Content Correlation Learning for Cross-Modal Retrieval in Social Multimedia , 2017, HCC.

[21]  Ling Shao,et al.  Deep Binaries: Encoding Semantic-Rich Cues for Efficient Textual-Visual Cross Retrieval , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Wei Liu,et al.  Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval , 2017, AAAI.

[23]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Philip S. Yu,et al.  Deep Visual-Semantic Hashing for Cross-Modal Retrieval , 2016, KDD.

[25]  Wei Wang,et al.  A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[26]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[27]  Xinbo Gao,et al.  Semantic Topic Multimodal Hashing for Cross-Media Retrieval , 2015, IJCAI.

[28]  Jianmin Wang,et al.  Semantics-preserving hashing for cross-view retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[30]  Guiguang Ding,et al.  Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Dongqing Zhang,et al.  Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.

[32]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[33]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[34]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[35]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[36]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.