论文信息 - Reinforced Cross-Media Correlation Learning by Context-Aware Bidirectional Translation

Reinforced Cross-Media Correlation Learning by Context-Aware Bidirectional Translation

The heterogeneity gap leads to inconsistent distributions and representations between image and text, which rises a challenging task to measure their similarities and construct cross-media correlation between them. The existing works mainly model the cross-media correlation in a common subspace, which causes insufficient correlation modeling in such third-party subspace with intermediate unidirectional transformation. Inspired by the recent advances of neural machine translation, which aims to establish a corresponding relationship between two entirely different languages, we can naturally discover that it has striking common characteristic with cross-media correlation learning to consider image and text as bilingual pairs, where the image is treated as a special kind of language to provide visual description, so that bidirectional transformation can be conducted between image and text to effectively explore cross-media correlation in the feature space of each media type. Thus, we propose a reinforced cross-media bidirectional translation (RCBT) approach to model the correlation between visual and textual descriptions. First, cross-media bidirectional translation mechanism is proposed to conduct direct transformation between the bilingual pairs of visual and textual descriptions bidirectionally, where the cross-media correlation can be effectively captured in both feature spaces of image and text through bidirectional translation training. Second, cross-media context-aware network with residual attention is proposed to exploit the rich spatial and temporal context hints with cross-media convolutional recurrent neural network, which can lead to more precise correlation learning for promoting bidirectional translation process. Third, cross-media reinforcement learning is proposed to perform a two-agent communication game played as a round between image and text to boost the bidirectional translation process, and we further extract inter-media and intra-media reward signals to provide complementary clues for learning cross-media correlation. Extensive experiments are conducted on cross-media retrieval to verify the effectiveness of our proposed RCBT approach, compared with 11 state-of-the-art methods on three cross-media datasets.

Yuxin Peng | Jinwei Qi | Yuxin Peng | Jinwei Qi

[1] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Xin Huang,et al. An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[3] C. V. Jawahar,et al. Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4] Qi Tian,et al. Pooling the Convolutional Layers in Deep ConvNets for Video Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[5] Yuxin Peng,et al. Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[6] Yue Gao,et al. Attribute-Augmented Semantic Hierarchy: Towards a Unified Framework for Content-Based Image Retrieval , 2014, TOMM.

[7] Ruifan Li,et al. Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[8] John Shawe-Taylor,et al. Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[9] Changsheng Xu,et al. Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[10] Ishwar K. Sethi,et al. Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[11] Yuxin Peng,et al. CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[12] Qi Tian,et al. Sequential Video VLAD: Training the Aggregation Locally and Temporally , 2018, IEEE Transactions on Image Processing.

[13] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[14] H. Hotelling. Relations Between Two Sets of Variates , 1936 .

[15] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[16] Ning Zhang,et al. Deep Reinforcement Learning-Based Image Captioning with Embedding Reward , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Yueting Zhuang,et al. Task-driven Visual Saliency and Attention-based Visual Question Answering , 2017, ArXiv.

[18] Svetlana Lazebnik,et al. Active Object Localization with Deep Reinforcement Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[20] Yongdong Zhang,et al. Context-Aware Visual Policy Network for Sequence-Level Image Captioning , 2018, ACM Multimedia.

[21] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[22] Yuxin Peng,et al. Modality-Specific Cross-Modal Similarity Measurement With Recurrent Attention Network , 2017, IEEE Transactions on Image Processing.

[23] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[25] Yuxin Peng,et al. CM-GANs , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[26] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[28] Yao Zhao,et al. Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[29] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[30] Cyrus Rashtchian,et al. Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[31] Xiaohua Zhai,et al. Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[32] Yuxin Peng,et al. Deep Cross-Media Knowledge Transfer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33] Xiaohua Zhai,et al. Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval , 2013, AAAI.

[34] Nitish Srivastava,et al. Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[35] Chong-Wah Ngo,et al. Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[36] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[37] Krystian Mikolajczyk,et al. Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[39] Michael Isard,et al. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[40] Tieniu Tan,et al. Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[42] Roger Levy,et al. A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[43] Gang Wang,et al. Convolutional recurrent neural networks: Learning spatial dependencies for image representation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[44] WangWei,et al. Effective multi-modal retrieval based on stacked auto-encoders , 2014, VLDB 2014.

[45] Tie-Yan Liu,et al. Dual Learning for Machine Translation , 2016, NIPS.

[46] Yang Yang,et al. Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[47] Jeff A. Bilmes,et al. Deep Canonical Correlation Analysis , 2013, ICML.