Self-supervised learning-based weight adaptive hashing for fast cross-modal retrieval

Due to the low storage cost and fast search speed, hashing is widely used in cross-modal retrieval. However, there still remain some crucial bottlenecks: Firstly, there are not suitable big datasets for multimodal data. Secondly, imbalance instances will affect the accuracy of the retrieval system. In this paper, we propose an end-to-end self-supervised learning-based weight adaptive hashing method for cross-modal retrieval. For the restriction of datasets, we use the self-supervised fashion to directly extract fine-grained features from labels and use them to supervise the hashing learning of other modalities. To overcome the problem of imbalance instances, we design an adaptive weight loss to flexibly adjust the weight of training samples according to their proportions. Besides these, we also use a binary approximation regularization term to reduce the regularization error. Experiments on MIRFLICKR-25K and NUS-WIDE datasets show that our method can improve 3% performance compared to other methods.

[1]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[2]  Wenwu Zhu,et al.  Deep Multimodal Hashing with Orthogonal Regularization , 2015, IJCAI.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Jieping Ye,et al.  A least squares formulation for canonical correlation analysis , 2008, ICML '08.

[5]  Wei Liu,et al.  Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval , 2017, AAAI.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[8]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[9]  Andrew Zisserman,et al.  Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Yueting Zhuang,et al.  Cross-Modal Learning to Rank via Latent Joint Representation , 2015, IEEE Transactions on Image Processing.

[11]  Yi Zhen,et al.  Co-Regularized Hashing for Multimodal Data , 2012, NIPS.

[12]  Wenwu Zhu,et al.  Learning Compact Hash Codes for Multimodal Representations Using Orthogonal Deep Structure , 2015, IEEE Transactions on Multimedia.

[13]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[14]  Fei Wang,et al.  Composite hashing with multiple information sources , 2011, SIGIR.

[15]  Guiguang Ding,et al.  Collective Matrix Factorization Hashing for Multimodal Data , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Zi Huang,et al.  Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[17]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[19]  Yueting Zhuang,et al.  Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment , 2015, ACM Multimedia.

[20]  Yue Gao,et al.  Real-Time Multimedia Social Event Detection in Microblog , 2018, IEEE Transactions on Cybernetics.

[21]  Xinbo Gao,et al.  Semantic Topic Multimodal Hashing for Cross-Media Retrieval , 2015, IJCAI.

[22]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Efstratios Gavves,et al.  Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jianmin Wang,et al.  Semantics-preserving hashing for cross-view retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[27]  Kuan-Ting Yu,et al.  Multi-view self-supervised deep learning for 6D pose estimation in the Amazon Picking Challenge , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[28]  Philip S. Yu,et al.  Composite Correlation Quantization for Efficient Multimodal Retrieval , 2015, SIGIR.

[29]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Hongxun Yao,et al.  Multi-modal microblog classification via multi-task learning , 2014, Multimedia Tools and Applications.

[31]  Min Yang,et al.  Robust Discriminative Tracking via Landmark-Based Label Propagation , 2015, IEEE Transactions on Image Processing.

[32]  Xiangyu Wang,et al.  Logo information recognition in large-scale social media data , 2014, Multimedia Systems.

[33]  Yue Gao,et al.  Approximating Discrete Probability Distribution of Image Emotions by Multi-Modal Features Fusion , 2017, IJCAI.

[34]  Yue Gao,et al.  Learning Visual Emotion Distributions via Multi-Modal Features Fusion , 2017, ACM Multimedia.

[35]  Dongqing Zhang,et al.  Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization , 2014, AAAI.

[36]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[37]  Jianmin Wang,et al.  Correlation Autoencoder Hashing for Supervised Cross-Modal Search , 2016, ICMR.

[38]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[39]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[40]  Zhou Yu,et al.  Cross-Media Hashing with Neural Networks , 2014, ACM Multimedia.

[41]  Philip S. Yu,et al.  Deep Visual-Semantic Hashing for Cross-Modal Retrieval , 2016, KDD.

[42]  Zhou Yu,et al.  Discriminative coupled dictionary hashing for fast cross-media retrieval , 2014, SIGIR.