Deep Semantic Mapping for Cross-Modal Retrieval

Cross-Modal mapping plays an essential role in multimedia information retrieval systems. However, most of existing work paid much attention on learning mapping functions but neglected the exploration of high-level semantic representation of modalities. Inspired by recent success of deep learning, in this paper, deep CNN (convolutional neural networks) features and topic features are utilized as visual and textual semantic representation respectively. To investigate the highly non-linear semantic correlation between image and text, we propose a regularized deep neural network(RE-DNN) for semantic mapping across modalities. By imposing intra-modal regularization as supervised pre-training, we finally learn a joint model which captures both intra-modal and inter-modal relationships. Our approach is superior to previous work in follows: (1) it explores high-level semantic correlations, (2) it requires little prior knowledge for model training, (3) it is able to tackle modality missing problem. Extensive experiments on benchmark Wikipedia dataset show RE-DNN outperforms the state-of-the-art approaches in cross-modal retrieval.

[1]  Andrea Vedaldi,et al.  Vlfeat: an open and portable library of computer vision algorithms , 2010, ACM Multimedia.

[2]  Beng Chin Ooi,et al.  Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[3]  Jian Pei,et al.  Parallel field alignment for cross media retrieval , 2013, ACM Multimedia.

[4]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[5]  Xiaohua Zhai,et al.  Cross-modality correlation propagation for cross-media retrieval , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Nuno Vasconcelos,et al.  Cross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems , 2014, Comput. Vis. Image Underst..

[7]  Dit-Yan Yeung,et al.  A Convex Formulation for Learning Task Relationships in Multi-Task Learning , 2010, UAI.

[8]  Yueting Zhuang,et al.  Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval , 2013, AAAI.

[9]  Jing Yu,et al.  Cross-modal topic correlations for multimedia retrieval , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[10]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[11]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[12]  Martin Guha,et al.  Encyclopedia of Statistics in Behavioral Science , 2006 .

[13]  Jean-Philippe Vert,et al.  Clustered Multi-Task Learning: A Convex Formulation , 2008, NIPS.

[14]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[16]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  B. Thompson Canonical Correlation Analysis , 1984 .

[18]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[19]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[20]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[21]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[23]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[24]  Jun Wang,et al.  Exploring Inter-feature and Inter-class Relationships with Deep Neural Networks for Video Classification , 2014, ACM Multimedia.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[27]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[28]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[29]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.