Generative Models of Visually Grounded Imagination

It is easy for people to imagine what a man with pink hair looks like, even if they have never seen such a person before. We call the ability to create images of novel semantic concepts visually grounded imagination. In this paper, we show how we can modify variational auto-encoders to perform this task. Our method uses a novel training objective, and a novel product-of-experts inference network, which can handle partially specified (abstract) concepts in a principled and efficient way. We also propose a set of easy-to-compute evaluation metrics that capture our intuitive notions of what it means to have good visual imagination, namely correctness, coverage, and compositionality (the 3 C's). Finally, we perform a detailed comparison of our method with two existing joint image-attribute VAE methods (the JMVAE method of Suzuki et.al. and the BiVCCA method of Wang et.al.) by applying them to two datasets: the MNIST-with-attributes dataset (which we introduce here), and the CelebA dataset.

[1]  Harlene Hayne,et al.  Pigeons on Par with Primates in Numerical Competence , 2011, Science.

[2]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[3]  Wei-Lun Chao,et al.  An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild , 2016, ECCV.

[4]  Subhransu Maji,et al.  A Taxonomy of Part and Attribute Discovery Techniques , 2017 .

[5]  J. Tenenbaum A Bayesian framework for concept learning , 1999 .

[6]  Timothy M. Hospedales,et al.  Gaussian Visual-Linguistic Embedding for Zero-Shot Recognition , 2016, EMNLP.

[7]  Yann LeCun,et al.  Disentangling factors of variation in deep representation using adversarial training , 2016, NIPS.

[8]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Christopher K. I. Williams,et al.  Autoencoders and Probabilistic Inference with Missing Data: An Exact Solution for The Factor Analysis Case , 2018, ArXiv.

[12]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[13]  Demis Hassabis,et al.  SCAN: Learning Abstract Hierarchical Compositional Visual Concepts , 2017, ArXiv.

[14]  Joshua B. Tenenbaum,et al.  Bayesian Modeling of Human Concept Learning , 1998, NIPS.

[15]  K. I. WilliamsDivision,et al.  Products of Gaussians and Probabilistic Minor Component Analysis , 2002 .

[16]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[17]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[18]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.

[19]  Andrew McCallum,et al.  Word Representations via Gaussian Embedding , 2014, ICLR.

[20]  Leonidas J. Guibas,et al.  Deep Knowledge Tracing , 2015, NIPS.

[21]  Alan L. Yuille,et al.  Joint Image-Text Representation by Gaussian Visual-Semantic Embedding , 2016, ACM Multimedia.

[22]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[23]  Bogdan Raducanu,et al.  Invertible Conditional GANs for image editing , 2016, ArXiv.

[24]  Stefano Soatto,et al.  Emergence of invariance and disentangling in deep representations , 2017 .

[25]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[26]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Marco Baroni,et al.  Grounding Distributional Semantics in the Visual World , 2016, Lang. Linguistics Compass.

[28]  Stefano Soatto,et al.  Visual Representations: Defining Properties and Deep Approximations , 2014, ICLR 2016.

[29]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Honglak Lee,et al.  Deep Variational Canonical Correlation Analysis , 2016, ArXiv.

[32]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[33]  Tom White,et al.  Sampling Generative Networks: Notes on a Few Effective Techniques , 2016, ArXiv.

[34]  Dhruv Batra,et al.  C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset , 2017, ArXiv.

[35]  Xiao Lin,et al.  Don't just listen, use your imagination: Leveraging visual common sense for non-visual tasks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Martial Hebert,et al.  From Red Wine to Red Tomato: Composition with Context , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Masahiro Suzuki,et al.  Joint Multimodal Learning with Deep Generative Models , 2016, ICLR.

[38]  Antonio Torralba,et al.  Cross-Modal Scene Networks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[40]  V. D. de Sa Category learning through multimodality sensing. , 1998, Neural computation.

[41]  Pascal Vincent,et al.  Generalized Denoising Auto-Encoders as Generative Models , 2013, NIPS.

[42]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[43]  Sebastian Nowozin,et al.  Multi-Level Variational Autoencoder: Learning Disentangled Representations from Grouped Observations , 2017, AAAI.

[44]  Matthew D. Hoffman,et al.  Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo , 2017, ICML.

[45]  Thomas L. Griffiths,et al.  Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies , 2013, NIPS.

[46]  Zhe Gan,et al.  Variational Autoencoder for Deep Learning of Images, Labels and Captions , 2016, NIPS.

[47]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[48]  Jonathan Berant,et al.  Learning to generalize to new compositions in image understanding , 2016, ArXiv.

[49]  Ruslan Salakhutdinov,et al.  On the quantitative analysis of deep belief networks , 2008, ICML '08.

[50]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[51]  Andrew Gordon Wilson,et al.  Multimodal Word Distributions , 2017, ACL.

[52]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[53]  Ambedkar Dukkipati,et al.  Variational methods for conditional multimodal deep learning , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[54]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[56]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[57]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[58]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[59]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[60]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[61]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.