论文信息 - Generative Models of Visually Grounded Imagination

Generative Models of Visually Grounded Imagination

It is easy for people to imagine what a man with pink hair looks like, even if they have never seen such a person before. We call the ability to create images of novel semantic concepts visually grounded imagination. In this paper, we show how we can modify variational auto-encoders to perform this task. Our method uses a novel training objective, and a novel product-of-experts inference network, which can handle partially specified (abstract) concepts in a principled and efficient way. We also propose a set of easy-to-compute evaluation metrics that capture our intuitive notions of what it means to have good visual imagination, namely correctness, coverage, and compositionality (the 3 C's). Finally, we perform a detailed comparison of our method with two existing joint image-attribute VAE methods (the JMVAE method of Suzuki et.al. and the BiVCCA method of Wang et.al.) by applying them to two datasets: the MNIST-with-attributes dataset (which we introduce here), and the CelebA dataset.

[1] Harlene Hayne,et al. Pigeons on Par with Primates in Numerical Competence , 2011, Science.

[2] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[3] Wei-Lun Chao,et al. An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild , 2016, ECCV.

[4] Subhransu Maji,et al. A Taxonomy of Part and Attribute Discovery Techniques , 2017 .

[5] J. Tenenbaum. A Bayesian framework for concept learning , 1999 .

[6] Timothy M. Hospedales,et al. Gaussian Visual-Linguistic Embedding for Zero-Shot Recognition , 2016, EMNLP.

[7] Yann LeCun,et al. Disentangling factors of variation in deep representation using adversarial training , 2016, NIPS.

[8] Honglak Lee,et al. Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10] Christoph H. Lampert,et al. Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Christopher K. I. Williams,et al. Autoencoders and Probabilistic Inference with Missing Data: An Exact Solution for The Factor Analysis Case , 2018, ArXiv.

[12] Matthias Bethge,et al. A note on the evaluation of generative models , 2015, ICLR.

[13] Demis Hassabis,et al. SCAN: Learning Abstract Hierarchical Compositional Visual Concepts , 2017, ArXiv.

[14] Joshua B. Tenenbaum,et al. Bayesian Modeling of Human Concept Learning , 1998, NIPS.

[15] K. I. WilliamsDivision,et al. Products of Gaussians and Probabilistic Minor Component Analysis , 2002 .

[16] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[17] Max Welling,et al. Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[18] Pieter Abbeel,et al. Variational Lossy Autoencoder , 2016, ICLR.

[19] Andrew McCallum,et al. Word Representations via Gaussian Embedding , 2014, ICLR.

[20] Leonidas J. Guibas,et al. Deep Knowledge Tracing , 2015, NIPS.

[21] Alan L. Yuille,et al. Joint Image-Text Representation by Gaussian Visual-Semantic Embedding , 2016, ACM Multimedia.

[22] Noam Chomsky,et al. वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[23] Bogdan Raducanu,et al. Invertible Conditional GANs for image editing , 2016, ArXiv.

[24] Stefano Soatto,et al. Emergence of invariance and disentangling in deep representations , 2017 .

[25] Geoffrey E. Hinton. Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[26] Xinlei Chen,et al. Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Marco Baroni,et al. Grounding Distributional Semantics in the Visual World , 2016, Lang. Linguistics Compass.

[28] Stefano Soatto,et al. Visual Representations: Defining Properties and Deep Approximations , 2014, ICLR 2016.

[29] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Bernt Schiele,et al. Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Honglak Lee,et al. Deep Variational Canonical Correlation Analysis , 2016, ArXiv.

[32] Christopher Burgess,et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[33] Tom White,et al. Sampling Generative Networks: Notes on a Few Effective Techniques , 2016, ArXiv.

[34] Dhruv Batra,et al. C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset , 2017, ArXiv.

[35] Xiao Lin,et al. Don't just listen, use your imagination: Leveraging visual common sense for non-visual tasks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Martial Hebert,et al. From Red Wine to Red Tomato: Composition with Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Masahiro Suzuki,et al. Joint Multimodal Learning with Deep Generative Models , 2016, ICLR.

[38] Antonio Torralba,et al. Cross-Modal Scene Networks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39] Pieter Abbeel,et al. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[40] V. D. de Sa. Category learning through multimodality sensing. , 1998, Neural computation.

[41] Pascal Vincent,et al. Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[42] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[43] Sebastian Nowozin,et al. Multi-Level Variational Autoencoder: Learning Disentangled Representations from Grouped Observations , 2017, AAAI.

[44] Matthew D. Hoffman,et al. Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo , 2017, ICML.

[45] Thomas L. Griffiths,et al. Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies , 2013, NIPS.

[46] Zhe Gan,et al. Variational Autoencoder for Deep Learning of Images, Labels and Captions , 2016, NIPS.

[47] Yoshua Bengio,et al. Generative Adversarial Networks , 2014, ArXiv.

[48] Jonathan Berant,et al. Learning to generalize to new compositions in image understanding , 2016, ArXiv.

[49] Ruslan Salakhutdinov,et al. On the quantitative analysis of deep belief networks , 2008, ICML '08.

[50] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[51] Andrew Gordon Wilson,et al. Multimodal Word Distributions , 2017, ACL.

[52] Soumith Chintala,et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[53] Ambedkar Dukkipati,et al. Variational methods for conditional multimodal deep learning , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[54] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[56] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[57] Ruslan Salakhutdinov,et al. Generating Images from Captions with Attention , 2015, ICLR.

[58] Xiaogang Wang,et al. Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[59] Max Welling,et al. Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[60] Honglak Lee,et al. Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[61] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.