Supervised Hierarchical Cross-Modal Hashing

Recently, due to the unprecedented growth of multimedia data, cross-modal hashing has gained increasing attention for the efficient cross-media retrieval. Typically, existing methods on cross-modal hashing treat labels of one instance independently but overlook the correlations among labels. Indeed, in many real-world scenarios, like the online fashion domain, instances (items) are labeled with a set of categories correlated by certain hierarchy. In this paper, we propose a new end-to-end solution for supervised cross-modal hashing, named HiCHNet, which explicitly exploits the hierarchical labels of instances. In particular, by the pre-established label hierarchy, we comprehensively characterize each modality of the instance with a set of layer-wise hash representations. In essence, hash codes are encouraged to not only preserve the layer-wise semantic similarities encoded by the label hierarchy, but also retain the hierarchical discriminative capabilities. Due to the lack of benchmark datasets, apart from adapting the existing dataset FashionVC from fashion domain, we create a dataset from the online fashion platform Ssense consisting of 15,696 image-text pairs labeled by 32 hierarchical categories. Extensive experiments on two real-world datasets demonstrate the superiority of our model over the state-of-the-art methods.

[1]  Qi Tian,et al.  Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.

[2]  Jungong Han,et al.  Cross-View Retrieval via Probability-Based Semantics-Preserving Hashing , 2017, IEEE Transactions on Cybernetics.

[3]  Dan Wang,et al.  Supervised Deep Hashing for Hierarchical Labeled Data , 2017, AAAI.

[4]  Qian Wang,et al.  Cross-modal hashing based on category structure preserving , 2018, J. Vis. Commun. Image Represent..

[5]  Ling Shao,et al.  Cycle-Consistent Deep Generative Hashing for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[6]  Yongdong Zhang,et al.  Full-Space Local Topology Extraction for Cross-Modal Retrieval , 2015, IEEE Transactions on Image Processing.

[7]  Zi Huang,et al.  Linear cross-modal hashing for efficient multimedia search , 2013, ACM Multimedia.

[8]  Luming Zhang,et al.  Interest Inference via Structure-Constrained Multi-Source Multi-Task Learning , 2015, IJCAI.

[9]  Fabio Crestani,et al.  Cluster-based Joint Matrix Factorization Hashing for Cross-Modal Retrieval , 2016, SIGIR.

[10]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[11]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[12]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[13]  Tiago Carvalho,et al.  Exposing Computer Generated Images by Eye’s Region Classification via Transfer Learning of VGG19 CNN , 2017, 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).

[14]  Guiguang Ding,et al.  Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Geyong Min,et al.  Supervised Intra- and Inter-Modality Similarity Preserving Hashing for Cross-Modal Retrieval , 2018, IEEE Access.

[16]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yue Gao,et al.  Large-Scale Cross-Modality Search via Collective Matrix Factorization Hashing , 2016, IEEE Transactions on Image Processing.

[18]  Xuelong Li,et al.  Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[19]  Jianmin Wang,et al.  Semantics-preserving hashing for cross-view retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jianmin Wang,et al.  Collective Deep Quantization for Efficient Cross-Modal Retrieval , 2017, AAAI.

[21]  D. Smith,et al.  ImageNet: a global distributed database for color image storage, and retrieval in medical imaging systems , 1992, [1992] Proceedings Fifth Annual IEEE Symposium on Computer-Based Medical Systems.

[22]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[23]  Ashish Khisti,et al.  On the Stability and Convergence of Stochastic Gradient Descent with Momentum , 2018, ArXiv.

[24]  Mason Swofford Image Completion on CIFAR-10 , 2018, ArXiv.

[25]  Guiguang Ding,et al.  Latent semantic sparse hashing for cross-modal similarity search , 2014, SIGIR.

[26]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[27]  An Li,et al.  Efficient cross-modal retrieval via flexible supervised collective matrix factorization hashing , 2018, Multimedia Tools and Applications.

[28]  Bin Liu,et al.  Cross-Modal Hamming Hashing , 2018, ECCV.

[29]  Jürgen Schmidhuber,et al.  Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Wei Liu,et al.  Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Meng Liu,et al.  Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning , 2019, IEEE Transactions on Image Processing.

[32]  King Ngi Ngan,et al.  Global and local semantics-preserving based deep hashing for cross-modal retrieval , 2018, Neurocomputing.

[33]  Xinbo Gao,et al.  Semantic Topic Multimodal Hashing for Cross-Media Retrieval , 2015, IJCAI.

[34]  Hiroyuki Arai,et al.  Alternating Co-Quantization for Cross-Modal Hashing , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Lei Zhu,et al.  Efficient discrete latent semantic hashing for scalable cross-modal retrieval , 2019, Signal Process..

[36]  Yi Zhen,et al.  Co-Regularized Hashing for Multimodal Data , 2012, NIPS.

[37]  Minyi Guo,et al.  Supervised hashing with latent factor models , 2014, SIGIR.

[38]  Wei Liu,et al.  Neural Compatibility Modeling with Attentive Knowledge Distillation , 2018, SIGIR.

[39]  Chong-Wah Ngo,et al.  Interpretable Multimodal Retrieval for Fashion Products , 2018, ACM Multimedia.

[40]  Zi Huang,et al.  Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[41]  Devraj Mandal,et al.  Generalized Semantic Preserving Hashing for Cross-Modal Retrieval , 2019, IEEE Transactions on Image Processing.

[42]  Dongqing Zhang,et al.  Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.

[43]  Jun Wang,et al.  Fast-Gaussian SIFT for Fast and Accurate Feature Extraction , 2016, PCM.

[44]  Zhou Yu,et al.  Discriminative coupled dictionary hashing for fast cross-media retrieval , 2014, SIGIR.