Dual-supervised attention network for deep cross-modal hashing

Abstract Cross-modal hashing has received intensive attention due to its low computation and storage efficiency in cross-modal retrieval task. Most previous cross-modal hashing methods mainly focus on extracting correlated binary codes from the pairwise label, but largely ignore the semantic categories of cross-modal data. On the other hand, human perception exploits category information to connect cross-modal samples. Inspired by this fact, we propose to embed category information into hash codes. More specifically, we introduce semantic prediction loss into our framework to enhance hash codes with category supervision. In addition, there always exists a large gap between features from different modalities (e.g. text and images), leading cross-modal hashing to link irrelevant features for retrieval task. To address this issue, this paper proposes Dual-Supervised Attention Network for Deep Hashing (DSADH) to learn the cross-modal relationship via an elaborately-designed attention mechanism. Our cross-modal network applies cross-modal attention block to efficiently encode rich and relevant features to learn compact hash codes. Extensive experiments on three challenging benchmarks demonstrate that our proposed method significantly improves the retrieval results.

[1]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[2]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Wu-Jun Li,et al.  Feature Learning Based Deep Supervised Hashing with Pairwise Labels , 2015, IJCAI.

[4]  Kristen Grauman,et al.  Kernelized locality-sensitive hashing for scalable image search , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[5]  Guiguang Ding,et al.  Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Wei Liu,et al.  Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval , 2017, AAAI.

[8]  Jonghyun Choi,et al.  Predictable Dual-View Hashing , 2013, ICML.

[9]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Quan Wang,et al.  Robust and Flexible Discrete Hashing for Cross-Modal Similarity Search , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Tao Mei,et al.  Deep Semantic-Preserving and Ranking-Based Hashing for Image Retrieval , 2016, IJCAI.

[12]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[14]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jianmin Wang,et al.  Semantics-preserving hashing for cross-view retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[19]  Jiwen Lu,et al.  Deep Variational and Structural Hashing , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Wei Xu,et al.  Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Yi Zhen,et al.  A probabilistic model for multimodal hash function learning , 2012, KDD.

[22]  Dongqing Zhang,et al.  Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.

[23]  David J. Fleet,et al.  Fast search in Hamming space with multi-index hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[25]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Shuicheng Yan,et al.  Non-Metric Locality-Sensitive Hashing , 2010, AAAI.

[27]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[28]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Xinbo Gao,et al.  Multimodal Discriminative Binary Embedding for Large-Scale Cross-Modal Retrieval , 2016, IEEE Transactions on Image Processing.

[30]  Fumin Shen,et al.  Inductive Hashing on Manifolds , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Ling Shao,et al.  Deep Binaries: Encoding Semantic-Rich Cues for Efficient Textual-Visual Cross Retrieval , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[33]  Yuxin Peng,et al.  The application of two-level attention models in deep convolutional neural network for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yi Shi,et al.  Deep Supervised Hashing with Triplet Labels , 2016, ACCV.

[35]  Gang Hua,et al.  Supervised Matrix Factorization for Cross-Modality Hashing , 2016, IJCAI.

[36]  Xiao Zhang,et al.  Sparse spectral hashing , 2012, Pattern Recognit. Lett..

[37]  Xinbo Gao,et al.  Semantic Topic Multimodal Hashing for Cross-Media Retrieval , 2015, IJCAI.

[38]  Qi Tian,et al.  Super-Bit Locality-Sensitive Hashing , 2012, NIPS.

[39]  David J. Fleet,et al.  Minimal Loss Hashing for Compact Binary Codes , 2011, ICML.

[40]  Wei Liu,et al.  Supervised Discrete Hashing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yuhong Guo,et al.  Zero-Shot Classification with Discriminative Semantic Representation Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Hiroyuki Arai,et al.  Alternating Co-Quantization for Cross-Modal Hashing , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Jiwen Lu,et al.  Deep Hashing for Scalable Image Search , 2017, IEEE Transactions on Image Processing.

[44]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).