A novel cross-modal hashing algorithm based on multimodal deep learning

With the growing popularity of multimodal data on the Web, cross-modal retrieval on large-scale multimedia databases has become an important research topic. Cross-modal retrieval methods based on hashing assume that there is a latent space shared by multimodal features. To model the relationship among heterogeneous data, most existing methods embed the data into a joint abstraction space by linear projections. However, these approaches are sensitive to noise in the data and are unable to make use of unlabeled data and multi-modal data with missing values in real-world applications. To address these challenges, we proposed a novel multimodal deep-learning-based hash (MDLH) algorithm. In particular, MDLH uses a deep neural network to encode heterogeneous features into a compact common representation and learns the hash functions based on the common representation. The parameters of the whole model are fine-tuned in a supervised training stage. Experiments on two standard datasets show that the method achieves more effective results than other methods in cross-modal retrieval.创新点随着网络上多模态数据的普及, 海量多媒体数据库上的跨模态检索成为研究的热点。跨模态检索方法假设多个模态的数据特征之间存在一个共享的潜在特征空间。因此, 为了建立多模态数据之间的关联, 大部分已有方法通过线性映射将多模态数据分别映射到同一个共享特征空间。但是, 该类方法对于数据中的噪声比较敏感, 并且也无法使用现实场景中的无标记的数据或缺失模态的数据。针对该问题本文提出了一种新的基于多模态深度学习的哈希算法。该方法使用深度神经网络结构将异构特征映射为一个共同的压缩表示, 并在此表示的基础上学习哈希函数。整个模型的参数通过有监督的方式进行训练。在两个标准数据集上的实验结果显示本文的方法能够有效的完成跨模态检索的任务。

[1]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[2]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[3]  Lei Zhang,et al.  Bit-Scalable Deep Hashing With Regularized Similarity Learning for Image Retrieval and Person Re-Identification , 2015, IEEE Transactions on Image Processing.

[4]  Luo Si,et al.  Learning to Hash on Partial Multi-Modal Data , 2015, IJCAI.

[5]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Yao Hu,et al.  Iterative Multi-View Hashing for Cross Media Indexing , 2014, ACM Multimedia.

[8]  Seungjin Choi,et al.  Deep Learning to Hash with Multiple Representations , 2012, 2012 IEEE 12th International Conference on Data Mining.

[9]  Guiguang Ding,et al.  Latent semantic sparse hashing for cross-modal similarity search , 2014, SIGIR.

[10]  Jiwu Huang,et al.  Perceptual video hashing robust against geometric distortions , 2011, Science China Information Sciences.

[11]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[12]  Yizhou Wang,et al.  Quantized Correlation Hashing for Fast Cross-Modal Search , 2015, IJCAI.

[13]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[14]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[15]  Yuanxi Li,et al.  Intelligent Social Media Indexing and Sharing Using an Adaptive Indexing Search Engine , 2012, TIST.

[16]  Yi Zhen,et al.  A probabilistic model for multimodal hash function learning , 2012, KDD.

[17]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[18]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[19]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[20]  James A. Anderson,et al.  Neurocomputing: Foundations of Research , 1988 .

[21]  Nikos Paragios,et al.  Data fusion through cross-modality metric learning using similarity-sensitive hashing , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Chao Chen,et al.  Web media semantic concept retrieval via tag removal and model fusion , 2013, ACM Trans. Intell. Syst. Technol..

[23]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[24]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[25]  Raghavendra Udupa,et al.  Learning Hash Functions for Cross-View Similarity Search , 2011, IJCAI.

[26]  Zi Huang,et al.  Inter-media hashing for large-scale retrieval from heterogeneous data sources , 2013, SIGMOD '13.

[27]  Ju Liu,et al.  Robust video hashing based on representative-dispersive frames , 2012, Science China Information Sciences.

[28]  Wenwu Zhu,et al.  Deep Multimodal Hashing with Orthogonal Regularization , 2015, IJCAI.

[29]  Zhou Yu,et al.  Discriminative coupled dictionary hashing for fast cross-media retrieval , 2014, SIGIR.

[30]  Emile H. L. Aarts,et al.  Boltzmann machines , 1998 .

[31]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[32]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[33]  WangWei,et al.  Effective multi-modal retrieval based on stacked auto-encoders , 2014, VLDB 2014.

[34]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[35]  Antonio Torralba,et al.  Spectral Hashing , 2008, NIPS.

[36]  Zi Huang,et al.  Linear cross-modal hashing for efficient multimedia search , 2013, ACM Multimedia.

[37]  Chunyan Miao,et al.  Online multimodal deep similarity learning with application to image retrieval , 2013, ACM Multimedia.

[38]  VincentPascal,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010 .