Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval

Abstract Recently deep cross-modal hashing (CMH) have received increased attention in multimedia information retrieval, as it is able to combine the benefit from the low storage cost and search efficiency of hashing, and the strong capabilities of feature abstraction by deep neural networks. CMH can effectively integrates hash representation learning and hash function optimization into an end-to-end framework. However, most of existing deep cross-modal hashing methods use a one-size-fits-all high level representation resulting in the loss of the spatial information of data. Also, previous methods mostly generated fixed length hashing codes. Here, the significance level of different bits are equally weighted thereby restricting their practical flexibility. To address these issues, we propose a self-constraining and attention-based hashing network (SCAHN) for bit-scalable cross-modal hashing. SCAHN integrates the label constraints from early and late-stages as well as their fused features into the hash representation and hash function learning. Moreover, as the fusion of early and late-stages features is based on an attention mechanism, each bit of the hashing codes can be unequally weighted so that the code lengths can be manipulated by ranking the significance of each bit without extra hash-model training. Extensive experiments conducted on four benchmark datasets demonstrate that our proposed SCAHN outperforms the current state-of-the-art CMH methods. Moreover, it is also shown that the generated bit-scalable hashing codes well-preserve the discriminative power with varying code lengths and obtain competitive results comparing to the state-of-the-art.

[1]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[2]  Bo Li,et al.  HashGAN: Attention-aware Deep Adversarial Hashing for Cross Modal Retrieval , 2017, ArXiv.

[3]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Laith Mohammad Abualigah,et al.  APPLYING GENETIC ALGORITHMS TO INFORMATION RETRIEVAL USING VECTOR SPACE MODEL , 2015 .

[5]  Hugo Larochelle,et al.  Topic Modeling of Multimodal Data: An Autoregressive Approach , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Pengfei Zhang,et al.  Semi-Relaxation Supervised Hashing for Cross-Modal Retrieval , 2017, ACM Multimedia.

[7]  Rongrong Ji,et al.  Supervised hashing with kernels , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Dapeng Tao,et al.  Manifold regularized kernel logistic regression for web image annotation , 2013, Neurocomputing.

[10]  Yueting Zhuang,et al.  Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[11]  Guiguang Ding,et al.  Latent semantic sparse hashing for cross-modal similarity search , 2014, SIGIR.

[12]  Erwin M. Bakker,et al.  Deep binary codes for large scale image retrieval , 2017, Neurocomputing.

[13]  Laith Mohammad Abualigah,et al.  A combination of objective functions and hybrid Krill herd algorithm for text document clustering analysis , 2018, Eng. Appl. Artif. Intell..

[14]  Devraj Mandal,et al.  Generalized Semantic Preserving Hashing for N-Label Cross-Modal Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[16]  Xinbo Gao,et al.  Triplet-Based Deep Hashing Network for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[17]  Bin Liu,et al.  Cross-Modal Hamming Hashing , 2018, ECCV.

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Jian Wang,et al.  Image-Text Cross-Modal Retrieval via Modality-Specific Feature Learning , 2015, ICMR.

[20]  Guiguang Ding,et al.  Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Gang Hua,et al.  A convolutional neural network cascade for face detection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[24]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Xiao-Yuan Jing,et al.  Intra-View and Inter-View Supervised Correlation Analysis for Multi-View Feature Learning , 2014, AAAI.

[26]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[27]  Wu-Jun Li,et al.  Isotropic Hashing , 2012, NIPS.

[28]  Wei Liu,et al.  Discrete Graph Hashing , 2014, NIPS.

[29]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[30]  Zi Huang,et al.  Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[31]  Qionghai Dai,et al.  Cross-Modality Bridging and Knowledge Transferring for Image Understanding , 2019, IEEE Transactions on Multimedia.

[32]  Wei Liu,et al.  Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[34]  David J. Fleet,et al.  Minimal Loss Hashing for Compact Binary Codes , 2011, ICML.

[35]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Laith Mohammad Abualigah,et al.  Hybrid clustering analysis using improved krill herd algorithm , 2018, Applied Intelligence.

[37]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Xiaoxiao Li,et al.  Semantic Image Segmentation via Deep Parsing Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Hugo Jair Escalante,et al.  The segmented and annotated IAPR TC-12 benchmark , 2010, Comput. Vis. Image Underst..

[40]  Michael S. Lew,et al.  Deep learning for visual understanding: A review , 2016, Neurocomputing.

[41]  Lei Zhang,et al.  Bit-Scalable Deep Hashing With Regularized Similarity Learning for Image Retrieval and Person Re-Identification , 2015, IEEE Transactions on Image Processing.

[42]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[43]  Laith Mohammad Abualigah,et al.  Unsupervised text feature selection technique based on hybrid particle swarm optimization algorithm with genetic operators for the text clustering , 2017, The Journal of Supercomputing.

[44]  Beng Chin Ooi,et al.  Effective deep learning-based multi-modal retrieval , 2015, The VLDB Journal.

[45]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[46]  Zhiming Luo,et al.  Invariance Matters: Exemplar Memory for Domain Adaptive Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Wei Liu,et al.  Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval , 2017, AAAI.

[48]  Yongdong Zhang,et al.  STAT: Spatial-Temporal Attention Mechanism for Video Captioning , 2020, IEEE Transactions on Multimedia.

[49]  Yueting Zhuang,et al.  Multi-modal Mutual Topic Reinforce Modeling for Cross-media Retrieval , 2014, ACM Multimedia.

[50]  Dongqing Zhang,et al.  Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.

[51]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[52]  Yuxin Peng,et al.  Unsupervised Generative Adversarial Cross-modal Hashing , 2017, AAAI.

[53]  Xin-Shun Xu Dictionary Learning Based Hashing for Cross-Modal Retrieval , 2016, ACM Multimedia.

[54]  Yueting Zhuang,et al.  Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment , 2015, ACM Multimedia.