CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning

It is known that the inconsistent distribution and representation of different modalities, such as image and text, cause the heterogeneity gap, which makes it very challenging to correlate such heterogeneous data. Recently, generative adversarial networks (GANs) have been proposed and shown its strong ability of modeling data distribution and learning discriminative representation, and most of the existing GANs-based works mainly focus on the unidirectional generative problem to generate new data such as image synthesis. While we have completely different goal, which aims to effectively correlate existing largescale heterogeneous data of different modalities, by utilizing the power of GANs to model the cross-modal joint distribution. Thus, in this paper we propose Cross-modal Generative Adversarial Networks (CM-GANs) to learn discriminative common representation for bridging the heterogeneity gap. The main contributions can be summarized as follows: (1) Cross-modal GANs architecture is proposed to model the joint distribution over the data of different modalities. The inter-modality and intramodality correlation can be explored simultaneously in generative and discriminative models. Both of them beat each other to promote cross-modal correlation learning. (2) Cross-modal convolutional autoencoders with weight-sharing constraint are proposed to form the generative model. They can not only exploit the cross-modal correlation for learning the common representation, but also preserve the reconstruction information for capturing the semantic consistency within each modality. (3) Cross-modal adversarial mechanism is proposed, which utilizes two kinds of discriminative models to simultaneously conduct intra-modality and inter-modality discrimination. They can mutually boost to make the generated common representation more discriminative by adversarial training process. To the best of our knowledge, our proposed CM-GANs approach is the first to utilize GANs to perform cross-modal common representation learning, by which the heterogeneous data can be effectively correlated. Extensive experiments are conducted to verify the performance of our proposed approach on cross-modal retrieval paradigm, compared with 10 state-of-the-art methods on 3 cross-modal datasets.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Iryna Gurevych,et al.  Learning Semantics with Deep Belief Network for Cross-Language Information Retrieval , 2012, COLING.

[3]  Qi Tian,et al.  Cross-Modal Retrieval Using Multiordered Discriminative Structured Subspace Learning , 2017, IEEE Transactions on Multimedia.

[4]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[6]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[7]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[8]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[9]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[10]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[11]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[12]  Yi Yang,et al.  Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[13]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[14]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Wenwu Zhu,et al.  Deep Multimodal Hashing with Orthogonal Regularization , 2015, IJCAI.

[16]  Wen Gao,et al.  Cross-media analysis and reasoning: advances and directions , 2017, Frontiers of Information Technology & Electronic Engineering.

[17]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[20]  Abhinav Gupta,et al.  Generative Image Modeling Using Style and Structure Adversarial Networks , 2016, ECCV.

[21]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Ishwar K. Sethi,et al.  Multimedia content processing through cross-modal association , 2003, MULTIMEDIA '03.

[23]  Roger Levy,et al.  A new approach to cross-modal multimedia retrieval , 2010, ACM Multimedia.

[24]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[25]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[26]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[27]  Yunchao Wei,et al.  Perceptual Generative Adversarial Networks for Small Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Xiaohua Zhai,et al.  Learning Cross-Media Joint Representation With Sparse and Semisupervised Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Yuxin Peng,et al.  CCL: Cross-modal Correlation Learning With Multigrained Fusion by Hierarchical Network , 2017, IEEE Transactions on Multimedia.

[30]  Chong-Wah Ngo,et al.  Deep Multimodal Learning for Affective Analysis and Retrieval , 2015, IEEE Transactions on Multimedia.

[31]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Yuxin Peng,et al.  Cross-Media Shared Representation by Hierarchical Learning with Multiple Deep Networks , 2016, IJCAI.

[33]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[34]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[35]  Changsheng Xu,et al.  Learning Consistent Feature Representation for Cross-Modal Multimedia Retrieval , 2015, IEEE Transactions on Multimedia.

[36]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[37]  C. V. Jawahar,et al.  Multi-label Cross-Modal Retrieval , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[39]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[40]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[41]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[42]  Yi Yang,et al.  Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-Media Retrieval , 2008, IEEE Transactions on Multimedia.

[43]  Xiaohua Zhai,et al.  Heterogeneous Metric Learning with Joint Graph Regularization for Cross-Media Retrieval , 2013, AAAI.

[44]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[45]  Tieniu Tan,et al.  Joint Feature Selection and Subspace Learning for Cross-Modal Retrieval , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.