A novel cross-modal hashing algorithm based on multimodal deep learning

With the popularity of multi-modal data on Web, cross media retrieval has become a hot research topic. Existing cross modal hash methods assume that there is a latent space shared by multi-modal features, and embed the heterogeneous data into a joint abstraction space by linear projections. However, these approaches are sensitive to the noise of data, and unable to make use of unlabelled data and multi-modal data with missing values in the real-world applications. To address these challenges, in this paper, we propose a novel Multi-modal Deep Learning based Hashing (MDLH) algorithm. In particular, MDLH adopts deep neural network to encode heterogeneous features into a compact common representation and learn the hash functions based on the common representation. The parameters of the whole model are fine-tuned in supervised training stage. Experiments on two standard datasets show that our method achieves more effective results than other methods in cross modal retrieval.

[1]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[2]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[3]  James A. Anderson,et al.  Neurocomputing: Foundations of Research , 1988 .

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[6]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[7]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[8]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[9]  Zi Huang,et al.  Linear cross-modal hashing for efficient multimedia search , 2013, ACM Multimedia.

[10]  Seungjin Choi,et al.  Deep Learning to Hash with Multiple Representations , 2012, 2012 IEEE 12th International Conference on Data Mining.

[11]  Yuanxi Li,et al.  Intelligent Social Media Indexing and Sharing Using an Adaptive Indexing Search Engine , 2012, TIST.

[12]  Wenwu Zhu,et al.  Deep Multimodal Hashing with Orthogonal Regularization , 2015, IJCAI.

[13]  Zhou Yu,et al.  Discriminative coupled dictionary hashing for fast cross-media retrieval , 2014, SIGIR.

[14]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[15]  Lei Zhang,et al.  Bit-Scalable Deep Hashing With Regularized Similarity Learning for Image Retrieval and Person Re-Identification , 2015, IEEE Transactions on Image Processing.

[16]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[17]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[19]  Yi Zhen,et al.  A probabilistic model for multimodal hash function learning , 2012, KDD.

[20]  Zi Huang,et al.  Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[21]  Chao Chen,et al.  Web media semantic concept retrieval via tag removal and model fusion , 2013, ACM Trans. Intell. Syst. Technol..

[22]  Emile H. L. Aarts,et al.  Boltzmann machines , 1998 .

[23]  Chunyan Miao,et al.  Online multimodal deep similarity learning with application to image retrieval , 2013, ACM Multimedia.

[24]  Luo Si,et al.  Learning to Hash on Partial Multi-Modal Data , 2015, IJCAI.

[25]  Beng Chin Ooi,et al.  Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[26]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[27]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[28]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[29]  Guiguang Ding,et al.  Latent semantic sparse hashing for cross-modal similarity search , 2014, SIGIR.

[30]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[31]  Yizhou Wang,et al.  Quantized Correlation Hashing for Fast Cross-Modal Search , 2015, IJCAI.

[32]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[33]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Yao Hu,et al.  Iterative Multi-View Hashing for Cross Media Indexing , 2014, ACM Multimedia.

[35]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[36]  Ju Liu,et al.  Robust video hashing based on representative-dispersive frames , 2012, Science China Information Sciences.

[37]  Jiwu Huang,et al.  Perceptual video hashing robust against geometric distortions , 2011, Science China Information Sciences.