Reinforced Cross-Media Correlation Learning by Context-Aware Bidirectional Translation

The heterogeneity gap leads to inconsistent distributions and representations between image and text, which rises a challenging task to measure their similarities and construct cross-media correlation between them. The existing works mainly model the cross-media correlation in a common subspace, which causes insufficient correlation modeling in such third-party subspace with intermediate unidirectional transformation. Inspired by the recent advances of neural machine translation, which aims to establish a corresponding relationship between two entirely different languages, we can naturally discover that it has striking common characteristic with cross-media correlation learning to consider image and text as bilingual pairs, where the image is treated as a special kind of language to provide visual description, so that bidirectional transformation can be conducted between image and text to effectively explore cross-media correlation in the feature space of each media type. Thus, we propose a reinforced cross-media bidirectional translation (RCBT) approach to model the correlation between visual and textual descriptions. First, cross-media bidirectional translation mechanism is proposed to conduct direct transformation between the bilingual pairs of visual and textual descriptions bidirectionally, where the cross-media correlation can be effectively captured in both feature spaces of image and text through bidirectional translation training. Second, cross-media context-aware network with residual attention is proposed to exploit the rich spatial and temporal context hints with cross-media convolutional recurrent neural network, which can lead to more precise correlation learning for promoting bidirectional translation process. Third, cross-media reinforcement learning is proposed to perform a two-agent communication game played as a round between image and text to boost the bidirectional translation process, and we further extract inter-media and intra-media reward signals to provide complementary clues for learning cross-media correlation. Extensive experiments are conducted on cross-media retrieval to verify the effectiveness of our proposed RCBT approach, compared with 11 state-of-the-art methods on three cross-media datasets.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Qi Tian,et al.  Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[6]  Yue Gao,et al.  Attribute-Augmented Semantic Hierarchy: Towards a Unified Framework for Content-Based Image Retrieval , 2014, TOMM.

[7]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[8]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[9]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[10]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[11]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[12]  Qi Tian,et al.  Sequential Video VLAD: Training the Aggregation Locally and Temporally , 2018, IEEE Transactions on Image Processing.

[13]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[14]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[15]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[16]  Ning Zhang,et al.  Deep Reinforcement Learning-Based Image Captioning with Embedding Reward , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yueting Zhuang,et al.  Task-driven Visual Saliency and Attention-based Visual Question Answering , 2017, ArXiv.

[18]  Svetlana Lazebnik,et al.  Active Object Localization with Deep Reinforcement Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[20]  Yongdong Zhang,et al.  Context-Aware Visual Policy Network for Sequence-Level Image Captioning , 2018, ACM Multimedia.

[21]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[22]  Yuxin Peng,et al.  Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network , 2017, IEEE Transactions on Image Processing.

[23]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[25]  Yuxin Peng,et al.  CM-GANs , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[28]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[29]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[30]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[31]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Yuxin Peng,et al.  Deep Cross-Media Knowledge Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Xiaohua Zhai,et al.  Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval , 2013, AAAI.

[34]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[35]  Chong-Wah Ngo,et al.  Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[36]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[39]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[40]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[42]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[43]  Gang Wang,et al.  Convolutional recurrent neural networks: Learning spatial dependencies for image representation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[44]  WangWei,et al.  Effective multi-modal retrieval based on stacked auto-encoders , 2014, VLDB 2014.

[45]  Tie-Yan Liu,et al.  Dual Learning for Machine Translation , 2016, NIPS.

[46]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[47]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.