Cross-Media Semantic Matching via Sparse Neural Network Pre-trained by Deep Restricted Boltzmann Machines

Cross-media retrieval arouses considerable attentions and becomes a more and more worthwhile research direction in the domain of information retrieval. Different from many related works which perform retrieval by mapping heterogeneous data into a common representation subspace using a couple of projection matrices, we input multi-modal media data into a model of neural network which utilize a deep sparse neural network pre-trained by restricted Boltzmann machines and output their semantic understanding for semantic matching (RSNN-SM). Consequently, the heterogeneous modality data are represented by their top-level semantic outputs, and cross-media retrieval is performed by measuring their semantic similarities. Experimental results on several real-world datasets show that, RSNN-SM obtains the best performance and outperforms the state-of-the-art approaches.

[1]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[2]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[3]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[4]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[5]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[6]  Meng Wang,et al.  Multimedia Question Answering , 2010, IEEE MultiMedia.

[7]  Jing Li,et al.  Video hashing based on appearance and attention features fusion via DBN , 2016, Neurocomputing.

[8]  Meng Zhao,et al.  An angle structure descriptor for image retrieval , 2016, China Communications.

[9]  Yao Zhao,et al.  Modality-Dependent Cross-Media Retrieval , 2015, ACM Trans. Intell. Syst. Technol..

[10]  Wei Wang,et al.  A Comprehensive Survey on Cross-modal Retrieval , 2016, ArXiv.

[11]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Meng Zhao,et al.  A novel image retrieval method based on multi-trend structure descriptor , 2016, J. Vis. Commun. Image Represent..

[13]  Xiaoqin Wang,et al.  Transfer Learning from Unlabeled Data via Neural Networks , 2012, Neural Processing Letters.

[14]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.