HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval

The purpose of cross-modal retrieval is to find the relationship between different modal samples and to retrieve other modal samples with similar semantics by using a certain modal sample. As the data of different modalities presents heterogeneous low-level feature and semantic-related high-level features, the main problem of cross-modal retrieval is how to measure the similarity between different modalities. In this article, we present a novel cross-modal retrieval method, named Hybrid Cross-Modal Similarity Learning model (HCMSL for short). It aims to capture sufficient semantic information from both labeled and unlabeled cross-modal pairs and intra-modal pairs with same classification label. Specifically, a coupled deep fully connected networks are used to map cross-modal feature representations into a common subspace. Weight-sharing strategy is utilized between two branches of networks to diminish cross-modal heterogeneity. Furthermore, two Siamese CNN models are employed to learn intra-modal similarity from samples of same modality. Comprehensive experiments on real datasets clearly demonstrate that our proposed technique achieves substantial improvements over the state-of-the-art cross-modal retrieval techniques.

[1]  Yang Wang,et al.  Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry, and Fusion , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[2]  Zheng Qin,et al.  Cryptanalysis and enhancement of an image encryption scheme based on a 1-D coupled Sine map , 2020, Nonlinear Dynamics.

[3]  Liqiang Nie,et al.  Cross-modal recipe retrieval via parallel- and cross-attention networks learning , 2020, Knowl. Based Syst..

[4]  Zheng-Jun Zha,et al.  Deep Coattention-Based Comparator for Relative Representation Learning in Person Re-Identification , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Ning Han,et al.  Video-based recipe retrieval , 2020, Inf. Sci..

[6]  Cong Xu,et al.  A novel image encryption algorithm based on bit-plane matrix rotation and hyper chaotic systems , 2019, Multimedia Tools and Applications.

[7]  Xin Huang,et al.  SLTFNet: A spatial and language-temporal tensor fusion network for video moment retrieval , 2019, Inf. Process. Manag..

[8]  Qi Tian,et al.  Video-Based Cross-Modal Recipe Retrieval , 2019, ACM Multimedia.

[9]  Bin Jiang,et al.  Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention , 2019, ICMR.

[10]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Xin Wen,et al.  Adversarial Cross-Modal Retrieval via Learning and Transferring Single-Modal Similarities , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[12]  Yang Wang,et al.  Few-Shot Deep Adversarial Learning for Video-Based Person Re-Identification , 2019, IEEE Transactions on Image Processing.

[13]  Weiwei Song,et al.  Deep Hashing Neural Networks for Hyperspectral Image Feature Extraction , 2019, IEEE Geoscience and Remote Sensing Letters.

[14]  Lin Wu,et al.  Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification , 2018, IEEE Transactions on Multimedia.

[15]  Tao Li,et al.  Rapid image retrieval with binary hash codes based on deep learning , 2018, International Workshop on Pattern Recognition.

[16]  Ling Shao,et al.  Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[17]  Xinbo Gao,et al.  Triplet-Based Deep Hashing Network for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[18]  Yuxin Peng,et al.  SCH-GAN: Semi-Supervised Cross-Modal Hashing by Generative Adversarial Network , 2018, IEEE Transactions on Cybernetics.

[19]  Jianmin Wang,et al.  Cross-View Retrieval via Probability-Based Semantics-Preserving Hashing , 2017, IEEE Transactions on Cybernetics.

[20]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[21]  Yuxin Peng,et al.  CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning , 2017, ArXiv.

[22]  Jiwen Lu,et al.  Cross-Modal Deep Variational Hashing , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Yang Wang,et al.  Multiview Spectral Clustering via Structured Low-Rank Matrix Factorization , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[24]  Xuelong Li,et al.  Deep Binary Reconstruction for Cross-Modal Hashing , 2017, IEEE Transactions on Multimedia.

[25]  Yuxin Peng,et al.  MHTN: Modal-Adversarial Hybrid Transfer Network for Cross-Modal Retrieval , 2017, IEEE Transactions on Cybernetics.

[26]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[27]  Xiaosong Zhao,et al.  Semi-supervised semantic factorization hashing for fast cross-modal retrieval , 2017, Multimedia Tools and Applications.

[28]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[29]  Yang Wang,et al.  Effective Multi-Query Expansions: Collaborative Deep Networks for Robust Landmark Retrieval , 2017, IEEE Transactions on Image Processing.

[30]  Jinjun Chen,et al.  Robust Hashing Based on Quaternion Zernike Moments for Image Authentication , 2016, ACM Trans. Multim. Comput. Commun. Appl..

[31]  Guojiang Xin,et al.  Robust Image Hashing Using Radon Transform and Invariant Features , 2016 .

[32]  Tieniu Tan,et al.  Group-Invariant Cross-Modal Subspace Learning , 2016, IJCAI.

[33]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[34]  Ling Shao,et al.  Semantic Boosting Cross-Modal Hashing for efficient multimedia retrieval , 2016, Inf. Sci..

[35]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Christoph Meinel,et al.  Deep Semantic Mapping for Cross-Modal Retrieval , 2015, 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI).

[37]  Lin Wu,et al.  Effective Multi-Query Expansions: Robust Landmark Retrieval , 2015, ACM Multimedia.

[38]  Wenwu Zhu,et al.  Learning Compact Hash Codes for Multimodal Representations Using Orthogonal Deep Structure , 2015, IEEE Transactions on Multimedia.

[39]  Jeff A. Bilmes,et al.  On Deep Multi-View Representation Learning , 2015, ICML.

[40]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[41]  Yueting Zhuang,et al.  Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval , 2014, ACM Multimedia.

[42]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[43]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[44]  Guiguang Ding,et al.  Latent semantic sparse hashing for cross-modal similarity search , 2014, SIGIR.

[45]  Hanqing Lu,et al.  Semi-supervised multi-graph hashing for scalable similarity search , 2014, Comput. Vis. Image Underst..

[46]  Guiguang Ding,et al.  Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Dongqing Zhang,et al.  Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.

[48]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[49]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Renjie Liao,et al.  Nonparametric bayesian upstream supervised multi-modal topic models , 2014, WSDM.

[51]  Zhou Yu,et al.  Sparse Multi-Modal Hashing , 2014, IEEE Transactions on Multimedia.

[52]  Y. L. Liu,et al.  A Robust Image Hashing Algorithm Resistant Against Geometrical Attacks , 2013 .

[53]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[54]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[55]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[56]  Nuno Vasconcelos,et al.  Maximum Covariance Unfolding : Manifold Learning for Bimodal Data , 2011, NIPS.

[57]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[58]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[59]  A. Ng,et al.  Multimodal Deep Learning , 2011, ICML.

[60]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[61]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[62]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[63]  Geng Guangzhi,et al.  Content Based Image Hashing Robust to Geometric Transformations , 2009, 2009 Second International Symposium on Electronic Commerce and Security.

[64]  Vasant Honavar,et al.  Multi-Modal Hierarchical Dirichlet Process Model for Predicting Image Annotation and Image-Object Label Correspondence , 2009, SDM.

[65]  Han-ling Zhang,et al.  A Novel Image Authentication Robust to Geometric Transformations , 2008, 2008 Congress on Image and Signal Processing.

[66]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[67]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[68]  Huazhong Shu,et al.  Robust hashing for image authentication using SIFT feature and quaternion Zernike moments , 2015, Multimedia Tools and Applications.