Cross-Modal Retrieval via Deep and Bidirectional Representation Learning

Cross-modal retrieval emphasizes understanding inter-modality semantic correlations, which is often achieved by designing a similarity function. Generally, one of the most important things considered by the similarity function is how to make the cross-modal similarity computable. In this paper, a deep and bidirectional representation learning model is proposed to address the issue of image-text cross-modal retrieval. Owing to the solid progress of deep learning in computer vision and natural language processing, it is reliable to extract semantic representations from both raw image and text data by using deep neural networks. Therefore, in the proposed model, two convolution-based networks are adopted to accomplish representation learning for images and texts. By passing the networks, images and texts are mapped to a common space, in which the cross-modal similarity is measured by cosine distance. Subsequently, a bidirectional network architecture is designed to capture the property of the cross-modal retrieval-the bidirectional search. Such architecture is characterized by simultaneously involving the matched and unmatched image-text pairs for training. Accordingly, a learning framework with maximum likelihood criterion is finally developed. The network parameters are optimized via backpropagation and stochastic gradient descent. A great deal of experiments are conducted to sufficiently evaluate the proposed method on three publicly released datasets: IAPRTC-12, Flickr30k, and Flickr8k. The overall results definitely show that the proposed architecture is effective and the learned representations have good semantics to achieve superior cross-modal retrieval performance.

[1]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[2]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[3]  Xiaohua Zhai,et al.  Effective Heterogeneous Similarity Measure with Nearest Neighbors for Cross-Media Retrieval , 2012, MMM.

[4]  Yueting Zhuang,et al.  A low rank structural large margin method for cross-modal ranking , 2013, SIGIR.

[5]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[6]  David W. Jacobs,et al.  Generalized Multiview Analysis: A discriminative latent space , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[11]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[12]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[13]  Yansheng Lu,et al.  A semantic model for cross-modal and multi-modal retrieval , 2013, ICMR '13.

[14]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[15]  Beng Chin Ooi,et al.  Effective Multi-Modal Retrieval based on Stacked Auto-Encoders , 2014, Proc. VLDB Endow..

[16]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[17]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[18]  B. S. Manjunath,et al.  Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[19]  Yueting Zhuang,et al.  Cross-media semantic representation via bi-directional learning to rank , 2013, ACM Multimedia.

[20]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[21]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[22]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[23]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[24]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[25]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[26]  Yi Yang,et al.  Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[27]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[28]  Cícero Nogueira dos Santos,et al.  Deep Convolutional Neural Networks for Sentiment Analysis of Short Texts , 2014, COLING.

[29]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[30]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[31]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  R. Mukundan,et al.  Moment Functions in Image Analysis: Theory and Applications , 1998 .

[33]  Roman Rosipal,et al.  Overview and Recent Advances in Partial Least Squares , 2005, SLSFS.

[34]  Beng Chin Ooi,et al.  Effective deep learning-based multi-modal retrieval , 2015, The VLDB Journal.

[35]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[36]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[37]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[38]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[39]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[40]  Xiaogang Wang,et al.  Deep Learning Face Representation by Joint Identification-Verification , 2014, NIPS.

[41]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[42]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[43]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[44]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[45]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[46]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[47]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[48]  D. Jacobs,et al.  Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch , 2011, CVPR 2011.

[49]  Hang Li,et al.  A Deep Architecture for Matching Short Texts , 2013, NIPS.

[50]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Zhou Yu,et al.  Discriminative coupled dictionary hashing for fast cross-media retrieval , 2014, SIGIR.

[52]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[53]  Xiaohua Zhai,et al.  Cross-modality correlation propagation for cross-media retrieval , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[54]  Tong Zhang,et al.  Effective Use of Word Order for Text Categorization with Convolutional Neural Networks , 2014, NAACL.

[55]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[56]  Trevor Darrell,et al.  Learning cross-modality similarity for multinomial data , 2011, 2011 International Conference on Computer Vision.

[57]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[58]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[59]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.