LAVAE: Disentangling Location and Appearance

We propose a probabilistic generative model for unsupervised learning of structured, interpretable, object-based representations of visual scenes. We use amortized variational inference to train the generative model end-to-end. The learned representations of object location and appearance are fully disentangled, and objects are represented independently of each other in the latent space. Unlike previous approaches that disentangle location and appearance, ours generalizes seamlessly to scenes with many more objects than encountered in the training regime. We evaluate the proposed model on multi-MNIST and multi-dSprites data sets.

[1]  Jürgen Schmidhuber,et al.  Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , 2018, ICLR.

[2]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[3]  Christopher P. Burgess,et al.  The Multi-Entity Variational Autoencoder , 2018 .

[4]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[5]  Jürgen Schmidhuber,et al.  Neural Expectation Maximization , 2017, NIPS.

[6]  Albert Yonas,et al.  Perceptual development in infancy , 2013 .

[7]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[8]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[9]  Yoshua Bengio,et al.  The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  A. Yuille,et al.  Opinion TRENDS in Cognitive Sciences Vol.10 No.7 July 2006 Special Issue: Probabilistic models of cognition Vision as Bayesian inference: analysis by synthesis? , 2022 .

[11]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[12]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Sergio Guadarrama,et al.  Tracking Emerges by Colorizing Videos , 2018, ECCV.

[14]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Jiajun Wu,et al.  Neural Scene De-rendering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[17]  Tomaso Poggio,et al.  Computational vision and regularization theory , 1985, Nature.

[18]  E. Spelke Where Perceiving Ends and Thinking Begins: The Apprehension of Objects in Infancy , 2013 .

[19]  Murray Shanahan,et al.  Reconciling deep learning with symbolic artificial intelligence: representing objects and relations , 2019, Current Opinion in Behavioral Sciences.

[20]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[21]  Elizabeth S. Spelke,et al.  Principles of Object Perception , 1990, Cogn. Sci..

[22]  Tomaso Poggio,et al.  Computational vision and regularization theory , 1985, Nature.

[23]  Jiajun Wu,et al.  Learning to Infer and Execute 3D Shape Programs , 2019, ICLR.

[24]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[25]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[26]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Katherine D. Kinzler,et al.  Core knowledge. , 2007, Developmental science.

[28]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[29]  Andriy Mnih,et al.  Disentangling by Factorising , 2018, ICML.

[30]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[31]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[32]  Shu Liu,et al.  Path Aggregation Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Bernhard Schölkopf,et al.  Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , 2018, ICML.

[34]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).