论文信息 - Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

Learned joint representations of images and text form the backbone of several important cross-domain tasks such as image captioning. Prior work mostly maps both domains into a common latent representation in a purely supervised fashion. This is rather restrictive, however, as the two domains follow distinct generative processes. Therefore, we propose a novel semi-supervised framework, which models shared information between domains and domain-specific information separately. The information shared between the domains is aligned with an invertible neural network. Our model integrates normalizing flow-based priors for the domain-specific information, which allows us to learn diverse many-to-many mappings between the two domains. We demonstrate the effectiveness of our model on diverse tasks, including image captioning and text-to-image synthesis.

[1] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[2] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Prafulla Dhariwal,et al. Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[5] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[6] Sanja Fidler,et al. Towards Diverse and Natural Image Descriptions via a Conditional GAN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7] Wei Xu,et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[8] Samy Bengio,et al. Density estimation using Real NVP , 2016, ICLR.

[9] Ali Razavi,et al. Preventing Posterior Collapse with delta-VAEs , 2019, ICLR.

[10] Max Welling,et al. VAE with a VampPrior , 2017, AISTATS.

[11] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12] Daniel McDuff,et al. M3D-GAN: Multi-Modal Multi-Domain Translation with Universal Attention , 2019, ArXiv.

[13] Gang Wang,et al. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14] Antoni B. Chan,et al. Describing Like Humans: On Diversity in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Alexei A. Efros,et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16] Ashwin K. Vijayakumar,et al. Diverse Beam Search for Improved Description of Complex Scenes , 2018, AAAI.

[17] Alexander G. Schwing,et al. Convolutional Image Captioning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Han Zhang,et al. Self-Attention Generative Adversarial Networks , 2018, ICML.

[19] Lin Yang,et al. Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Bernt Schiele,et al. Conditional Flow Variational Autoencoders for Structured Sequence Prediction , 2019, ArXiv.

[21] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[22] David Duvenaud,et al. Invertible Residual Networks , 2018, ICML.

[23] Jing Zhang,et al. MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Bernt Schiele,et al. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Charles A. Sutton,et al. VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning , 2017, NIPS.

[27] Marcus Liwicki,et al. TAC-GAN - Text Conditioned Auxiliary Classifier Generative Adversarial Network , 2017, ArXiv.

[28] Yoshua Bengio,et al. NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[29] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Liwei Wang,et al. Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31] Xiaogang Wang,et al. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32] John E. Hopcroft,et al. Stacked Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Alexander M. Rush,et al. Latent Normalizing Flows for Discrete Sequences , 2019, ICML.

[34] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[35] Ullrich Köthe,et al. Analyzing Inverse Problems with Invertible Neural Networks , 2018, ICLR.

[36] Dhruv Batra,et al. Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] Yueting Zhuang,et al. Diverse Image Captioning via GroupTalk , 2016, IJCAI.

[38] Svetlana Lazebnik,et al. Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space , 2017, NIPS.

[39] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[40] Alexander Schwing,et al. Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[42] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[43] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[44] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[45] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[46] Pieter Abbeel,et al. Variational Lossy Autoencoder , 2016, ICLR.

[47] Nenghai Yu,et al. Semantics Disentangling for Text-To-Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Lei Zhang,et al. Generating Diverse and Accurate Visual Captions by Comparative Adversarial Learning , 2018, ArXiv.

[49] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.