Deep Learning and Shared Representation Space Learning Based Cross-Modal Multimedia Retrieval

An increasing number of different multimedia information, including text, voice, video and image, are used to describe the same semantic concept together on the Internet. This paper presents a new method to more efficiently cross-modal multimedia retrieval. Using image and text as an example, we learn the deep learning features of images by convolution neural networks, and learn the text features by a latent Dirichlet allocation model. Then map the two features spaces into a shared presentation space by a probability model in order that they are isomorphic. At last, we adopt centered correlation to measure the distance between them. The experimental results in the Wikipedia dataset show that our approach can achieve the state-of-the-art results.

[1]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[2]  Jian Yang,et al.  Why does the unsupervised pretraining encourage moderate-sparseness? , 2013, ArXiv.

[3]  Yueting Zhuang,et al.  Cross-Modal Learning to Rank via Latent Joint Representation , 2015, IEEE Transactions on Image Processing.

[4]  Yi Yang,et al.  Ranking with local regression and global alignment for cross media retrieval , 2009, ACM Multimedia.

[5]  Yueting Zhuang,et al.  A low rank structural large margin method for cross-modal ranking , 2013, SIGIR.

[6]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[7]  Yan Liu,et al.  Topic-link LDA: joint models of topic and author community , 2009, ICML '09.

[8]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[9]  Qingming Huang,et al.  Learning image Vicept description via mixed-norm regularization for large scale semantic image search , 2011, CVPR 2011.

[10]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[14]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[15]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[16]  Beng Chin Ooi,et al.  Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[17]  Xiaohua Zhai,et al.  Tri-space and ranking based heterogeneous similarity measure for cross-media retrieval , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[18]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[19]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[20]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[21]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[22]  Zhi-Hua Zhou,et al.  Ensemble approach based on conditional random field for multi-label image and video annotation , 2011, ACM Multimedia.

[23]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[24]  Qingming Huang,et al.  Cross-media topic detection: A multi-modality fusion framework , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.