Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network

Nowadays, cross-modal retrieval plays an important role to flexibly find useful information across different modalities of data. Effectively measuring the similarity between different modalities of data is the key of cross-modal retrieval. Different modalities, such as image and text, have imbalanced and complementary relationship, and they contain unequal amount of information when describing the same semantics. For example, images often contain more details that cannot be demonstrated by textual descriptions and vice versa. Existing works based on deep neural network mostly construct one common space for different modalities, to find the latent alignments between them, which lose their exclusive modality-specific characteristics. Therefore, we propose modality-specific cross-modal similarity measurement approach by constructing the independent semantic space for each modality, which adopts an end-to-end framework to directly generate the modality-specific cross-modal similarity without explicit common representation. For each semantic space, modality-specific characteristics within one modality are fully exploited by recurrent attention network, while the data of another modality is projected into this space with attention based joint embedding, which utilizes the learned attention weights for guiding the fine-grained cross-modal correlation learning, and captures the imbalanced and complementary relationship between different modalities. Finally, the complementarity between the semantic spaces for different modalities is explored by adaptive fusion of the modality-specific cross-modal similarities to perform the cross-modal retrieval. Experiments on the widely used Wikipedia, Pascal Sentence, and MS-COCO data sets as well as our constructed large-scale XMediaNet data set verify the effectiveness of our proposed approach, outperforming nine state-of-the-art methods.

[1]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[2]  Xiaohua Zhai,et al.  Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval , 2013, AAAI.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[5]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Xuelong Li,et al.  Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[8]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[11]  Xiaohua Zhai,et al.  Semi-Supervised Cross-Media Feature Learning With Unified Patch Graph Regularization , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[13]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[14]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[15]  Meng Wang,et al.  Neighborhood Discriminant Hashing for Large-Scale Image Retrieval , 2015, IEEE Transactions on Image Processing.

[16]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[19]  Jian Wang,et al.  Cross-Modal Retrieval via Deep and Bidirectional Representation Learning , 2016, IEEE Transactions on Multimedia.

[20]  Liang Wang,et al.  Cross-Modal Subspace Learning via Pairwise Constraints , 2014, IEEE Transactions on Image Processing.

[21]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[22]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[24]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[26]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Xinbo Gao,et al.  Multimodal Discriminative Binary Embedding for Large-Scale Cross-Modal Retrieval , 2016, IEEE Transactions on Image Processing.

[29]  Yueting Zhuang,et al.  Learning of Multimodal Representations With Random Walks on the Click Graph , 2016, IEEE Transactions on Image Processing.

[30]  Roger Levy,et al.  On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Mikhail Belkin,et al.  Regularization and Semi-supervised Learning on Large Graphs , 2004, COLT.

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[34]  Yueting Zhuang,et al.  Cross-Modal Learning to Rank via Latent Joint Representation , 2015, IEEE Transactions on Image Processing.

[35]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[36]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[37]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[38]  Lior Wolf,et al.  Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[40]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[41]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[42]  Yueting Zhuang,et al.  Task-driven Visual Saliency and Attention-based Visual Question Answering , 2017, ArXiv.

[43]  Qi Tian,et al.  Coarse2Fine: Two-Layer Fusion For Image Retrieval , 2016, ArXiv.

[44]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[45]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[46]  Yuxin Peng,et al.  CM-GANs , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[47]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[48]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[49]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[50]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[51]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[52]  Yuxin Peng,et al.  Clip-based similarity measure for query-dependent clip retrieval and video summarization , 2006, IEEE Trans. Circuits Syst. Video Technol..