Unsupervised Learning of Compositional Energy Concepts

Humans are able to rapidly understand scenes by utilizing concepts extracted from prior experience. Such concepts are diverse, and include global scene descriptors, such as the weather or lighting, as well as local scene descriptors, such as the color or size of a particular object. So far, unsupervised discovery of concepts has focused on either modeling the global scene-level or the local object-level factors of variation, but not both. In this work, we propose COMET, which discovers and represents concepts as separate energy functions, enabling us to represent both global concepts as well as objects under a unified framework. COMET discovers energy functions through recomposing the input image, which we find captures independent factors without additional supervision. Sample generation in COMET is formulated as an optimization process on underlying energy functions, enabling us to generate images with permuted and composed concepts. Finally, discovered visual concepts in COMET generalize well, enabling us to compose concepts between separate modalities of images as well as with other concepts discovered by a separate instance of COMET trained on a different dataset*.

[1]  Aapo Hyvärinen,et al.  Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA , 2016, NIPS.

[2]  Pierre Comon Independent component analysis - a new concept? signal processing , 1994 .

[3]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[4]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[5]  Roger B. Grosse,et al.  Isolating Sources of Disentanglement in Variational Autoencoders , 2018, NeurIPS.

[6]  Jürgen Schmidhuber,et al.  Neural Expectation Maximization , 2017, NIPS.

[7]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Anjul Patney,et al.  Semi-Supervised StyleGAN for Disentanglement Learning , 2020, ICML.

[9]  Jiajun Wu,et al.  Unsupervised Discovery of 3D Physical Objects from Video , 2020, ICLR.

[10]  Yee Whye Teh,et al.  Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.

[11]  Aapo Hyvärinen,et al.  Nonlinear ICA of Temporally Dependent Stationary Sources , 2017, AISTATS.

[12]  Shuang Li,et al.  Compositional Visual Generation with Energy Based Models , 2020, NeurIPS.

[13]  Yang Lu,et al.  A Theory of Generative ConvNet , 2016, ICML.

[14]  Luke Metz,et al.  On Linear Identifiability of Learned Representations , 2020, ICML.

[15]  Georg Martius,et al.  Variational Autoencoders Pursue PCA Directions (by Accident) , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Aapo Hyvärinen,et al.  Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning , 2018, AISTATS.

[17]  Mohammad Norouzi,et al.  Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One , 2019, ICLR.

[18]  Zhijian Ou,et al.  Learning Neural Random Fields with Inclusive Auxiliary Generators , 2018, ArXiv.

[19]  Guillaume Desjardins,et al.  Understanding disentangling in $\beta$-VAE , 2018, 1804.03599.

[20]  Shuang Li,et al.  Improved Contrastive Divergence Training of Energy Based Models , 2020, ICML.

[21]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[22]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[23]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[25]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[26]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[27]  Andreas Geiger,et al.  Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Igor Mordatch,et al.  Implicit Generation and Generalization with Energy Based Models , 2018 .

[29]  Jason Yosinski,et al.  An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , 2018, NeurIPS.

[30]  Naila Murray,et al.  Virtual KITTI 2 , 2020, ArXiv.

[31]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[32]  Matthias Bethge,et al.  Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding , 2020, ICLR.

[33]  Ben Poole,et al.  Weakly-Supervised Disentanglement Without Compromises , 2020, ICML.

[34]  Russell Impagliazzo,et al.  Complexity of k-SAT , 1999, Proceedings. Fourteenth Annual IEEE Conference on Computational Complexity (Formerly: Structure in Complexity Theory Conference) (Cat.No.99CB36317).

[35]  Sjoerd van Steenkiste,et al.  A Case for Object Compositionality in Deep Generative Models of Images , 2018, ArXiv.

[36]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Yoshua Bengio,et al.  Deep Directed Generative Models with Energy-Based Probability Estimation , 2016, ArXiv.

[39]  Matthias Bethge,et al.  Contrastive Learning Inverts the Data Generating Process , 2021, ICML.

[40]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[41]  Kevin Murphy,et al.  Generative Models of Visually Grounded Imagination , 2017, ICLR.

[42]  Ingmar Posner,et al.  GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations , 2019, ICLR.

[43]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[44]  Tian Han,et al.  On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models , 2019, AAAI.

[45]  Diederik P. Kingma,et al.  ICE-BeeM: Identifiable Conditional Energy-Based Deep Models , 2020, NeurIPS.

[46]  Jürgen Schmidhuber,et al.  Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , 2018, ICLR.

[47]  Andrew M. Dai,et al.  Flow Contrastive Estimation of Energy-Based Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Kevin A. Smith,et al.  Rapid trial-and-error learning with simulation supports flexible tool use and physical reasoning , 2019, Proceedings of the National Academy of Sciences.

[49]  Aapo Hyvärinen,et al.  Variational Autoencoders and Nonlinear ICA: A Unifying Framework , 2019, AISTATS.

[50]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[51]  Jiajun Wu,et al.  Entity Abstraction in Visual Model-Based Reinforcement Learning , 2019, CoRL.

[52]  Jürgen Schmidhuber,et al.  R-SQAIR: Relational Sequential Attend, Infer, Repeat , 2019, ArXiv.