Unsupervised Contrastive Cross-Modal Hashing

In this paper, we study how to make unsupervised cross-modal hashing (CMH) benefit from contrastive learning (CL) by overcoming two challenges. To be exact, i) to address the performance degradation issue caused by binary optimization for hashing, we propose a novel momentum optimizer that performs hashing operation learnable in CL, thus making on-the-shelf deep cross-modal hashing possible. In other words, our method does not involve binary-continuous relaxation like most existing methods, thus enjoying better retrieval performance; ii) to alleviate the influence brought by false-negative pairs (FNPs), we propose a Cross-modal Ranking Learning loss (CRL) which utilizes the discrimination from all instead of only the hard negative pairs, where FNP refers to the within-class pairs that were wrongly treated as negative pairs. Thanks to such a global strategy, CRL endows our method with better performance because CRL will not overuse the FNPs while ignoring the true-negative pairs. To the best of our knowledge, the proposed method could be one of the first successful contrastive hashing methods. To demonstrate the effectiveness of the proposed method, we carry out experiments on five widely-used datasets compared with 13 state-of-the-art methods. The code is available at https://github.com/penghu-cs/UCCH.

[1]  Yong Hu,et al.  Deep Cross-Modal Hashing With Hashing Functions and Unified Hash Codes Jointly Learning , 2019, IEEE Transactions on Knowledge and Data Engineering.

[2]  Nam Ik Cho,et al.  Self-supervised Product Quantization for Deep Unsupervised Image Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Mingyong Li,et al.  Unsupervised Deep Cross-Modal Hashing by Knowledge Distillation for Large-scale Cross-modal Retrieval , 2021, ICMR.

[4]  Dacheng Tao,et al.  Deep Graph-neighbor Coherence Preserving Network for Unsupervised Cross-modal Hashing , 2021, AAAI.

[5]  Changyou Chen,et al.  Unsupervised Hashing with Contrastive Information Bottleneck , 2021, IJCAI.

[6]  Shuli Cheng,et al.  Deep Semantic-Preserving Reconstruction Hashing for Unsupervised Cross-Modal Retrieval , 2020, Entropy.

[7]  Erwin M. Bakker,et al.  Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval , 2020, Neurocomputing.

[8]  Song Liu,et al.  Joint-modal Distribution-based Similarity Hashing for Large-scale Unsupervised Deep Cross-modal Retrieval , 2020, SIGIR.

[9]  Jie Lin,et al.  Semi-Supervised Multi-Modal Learning with Balanced Spectral Decomposition , 2020, AAAI.

[10]  Yee Leung,et al.  Robust Multiview Subspace Learning With Nonindependently and Nonidentically Distributed Complex Noise , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Yang Yang,et al.  Graph Convolutional Network Hashing , 2020, IEEE Transactions on Cybernetics.

[12]  Dacheng Tao,et al.  Adversarial Examples for Hamming Space Search , 2020, IEEE Transactions on Cybernetics.

[13]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[14]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[16]  Rui Zhang,et al.  Contrastive Self-Supervised Hashing With Dual Pseudo Agreement , 2020, IEEE Access.

[17]  Deyu Meng,et al.  Self-paced Multi-view Co-training , 2020, J. Mach. Learn. Res..

[18]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[19]  Huaxiang Zhang,et al.  Flexible Online Multi-modal Hashing for Large-scale Multimedia Retrieval , 2019, ACM Multimedia.

[20]  Dezhong Peng,et al.  Separated Variational Hashing Networks for Cross-Modal Retrieval , 2019, ACM Multimedia.

[21]  Chao Zhang,et al.  Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Dezhong Peng,et al.  Scalable Deep Multimodal Learning for Cross-Modal Retrieval , 2019, SIGIR.

[23]  Heng Tao Shen,et al.  Deep Recurrent Quantization for Generating Sequential Binary Codes , 2019, IJCAI.

[24]  Jiancheng Lv,et al.  COMIC: Multi-view Clustering Without Parameter Selection , 2019, ICML.

[25]  Jun Wang,et al.  Ranking-based Deep Cross-modal Hashing , 2019, AAAI.

[26]  Chao Li,et al.  Coupled CycleGAN: Unsupervised Hashing Network for Cross-Modal Retrieval , 2019, AAAI.

[27]  Wu-Jun Li,et al.  Discrete Latent Factor Model for Cross-Modal Hashing , 2017, IEEE Transactions on Image Processing.

[28]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[29]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Xinbo Gao,et al.  Triplet-Based Deep Hashing Network for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[31]  Wei Liu,et al.  Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[33]  Yuxin Peng,et al.  Unsupervised Generative Adversarial Cross-modal Hashing , 2017, AAAI.

[34]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[35]  Jungong Han,et al.  Cross-View Retrieval via Probability-Based Semantics-Preserving Hashing , 2017, IEEE Transactions on Cybernetics.

[36]  Kien A. Hua,et al.  Linear Subspace Ranking Hashing for Cross-Modal Retrieval , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Wei Liu,et al.  Classification by Retrieval: Binarizing Data and Classifiers , 2017, SIGIR.

[38]  Rongrong Ji,et al.  Cross-Modality Binary Code Learning via Fusion Similarity Hashing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ling Shao,et al.  Deep Binaries: Encoding Semantic-Rich Cues for Efficient Textual-Visual Cross Retrieval , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Devraj Mandal,et al.  Generalized Semantic Preserving Hashing for N-Label Cross-Modal Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Xuelong Li,et al.  Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[42]  Shiming Xiang,et al.  Cross-Modal Hashing via Rank-Order Preserving , 2017, IEEE Transactions on Multimedia.

[43]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yue Gao,et al.  Large-Scale Cross-Modality Search via Collective Matrix Factorization Hashing , 2016, IEEE Transactions on Image Processing.

[46]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[47]  Wei-Yun Yau,et al.  Deep Subspace Clustering with Sparsity Prior , 2016, IJCAI.

[48]  Jiwen Lu,et al.  Learning Compact Binary Face Descriptor for Face Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Jianmin Wang,et al.  Semantics-preserving hashing for cross-view retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Wei Liu,et al.  Supervised Discrete Hashing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[52]  Guiguang Ding,et al.  Latent semantic sparse hashing for cross-modal similarity search , 2014, SIGIR.

[53]  Guiguang Ding,et al.  Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[55]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[56]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[57]  Zi Huang,et al.  Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[58]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[59]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[60]  Hugo Jair Escalante,et al.  The segmented and annotated IAPR TC-12 benchmark , 2010, Comput. Vis. Image Underst..

[61]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[62]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).