Unsupervised Learning of Compositional Scene Representations from Multiple Unspecified Viewpoints

Visual scenes are extremely rich in diversity, not only because there are infinite combinations of objects and background, but also because the observations of the same scene may vary greatly with the change of viewpoints. When observing a visual scene that contains multiple objects from multiple viewpoints, humans are able to perceive the scene in a compositional way from each viewpoint, while achieving the socalled “object constancy” across different viewpoints, even though the exact viewpoints are untold. This ability is essential for humans to identify the same object while moving and to learn from vision efficiently. It is intriguing to design models that have the similar ability. In this paper, we consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision, and propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem. To infer latent representations, the information contained in different viewpoints is iteratively integrated by neural networks. Experiments on several specifically designed synthetic datasets have shown that the proposed method is able to effectively learn from multiple unspecified viewpoints. Introduction Vision is an important way for humans to acquire knowledge about the world. Due to the diverse combinations of objects and background that constitute visual scenes, it is hard to model the whole scene directly. In the process of learning from the world, humans are able to develop the concept of object (Johnson 2010), and is thus capable of perceiving visual scenes compositionally, which in turn leads to more efficient learning compared with perceiving the entire scene as a whole (Fodor and Pylyshyn 1988). Compositionality is one of the fundamental ingredients for building artificial intelligence systems that learn efficiently and effectively like humans (Lake et al. 2017). Therefore, instead of learning a single representation for the entire visual scene, it is desirable to build compositional scene representation models which learn object-centric representations (i.e., learn separate representations for different objects and background), so that the combinational property can be better captured. *Corresponding author Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. scene vi ew 1 obj 1 obj 2 obj 3 obj 4 bck

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Yisong Yue,et al.  Iterative Amortized Inference , 2018, ICML.

[3]  Sungjin Ahn,et al.  Generative Neurosymbolic Machines , 2020, NeurIPS.

[4]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[5]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[6]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[7]  Xiangyang Xue,et al.  Knowledge-Guided Object Discovery with Acquired Deep Impressions , 2021, AAAI.

[8]  Andreas M. Lehrmann,et al.  PROVIDE: a probabilistic framework for unsupervised video decomposition , 2021, UAI.

[9]  Ingmar Posner,et al.  GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations , 2019, ICLR.

[10]  Jürgen Schmidhuber,et al.  Neural Expectation Maximization , 2017, NIPS.

[11]  David Barber,et al.  Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Karol Gregor,et al.  Neural Variational Inference and Learning in Belief Networks , 2014, ICML.

[13]  C. Lawrence Zitnick,et al.  Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Sanjay Ranka,et al.  Efficient Iterative Amortized Inference for Learning Symmetric and Disentangled Multi-Object Representations , 2021, ICML.

[15]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[16]  Jiajun Wu,et al.  Entity Abstraction in Visual Model-Based Reinforcement Learning , 2019, CoRL.

[17]  Zenon W. Pylyshyn,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[18]  Jürgen Schmidhuber,et al.  Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , 2018, ICLR.

[19]  Jürgen Schmidhuber,et al.  R-SQAIR: Relational Sequential Attend, Infer, Repeat , 2019, ArXiv.

[20]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[21]  Sungjin Ahn,et al.  SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition , 2020, ICLR.

[22]  Joelle Pineau,et al.  Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking , 2019, AAAI.

[23]  R. Weale Vision. A Computational Investigation Into the Human Representation and Processing of Visual Information. David Marr , 1983 .

[24]  Bin Li,et al.  Spatial Mixture Models with Learnable Deep Priors for Perceptual Grouping , 2019, AAAI.

[25]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[26]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[27]  Joelle Pineau,et al.  Spatially Invariant Unsupervised Object Detection with Convolutional Neural Networks , 2019, AAAI.

[28]  Alexander S. Ecker,et al.  Benchmarking Unsupervised Object Representations for Video Sequences , 2020, J. Mach. Learn. Res..

[29]  Kevin Murphy,et al.  Efficient inference in occlusion-aware generative models of images , 2015, ArXiv.

[30]  Xiaoyong Shen,et al.  Amodal Instance Segmentation With KINS Dataset , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[32]  Robert B. Fisher,et al.  Learning Object-Centric Representations of Multi-Object Scenes from Multiple Views , 2021, NeurIPS.

[33]  Bin Li,et al.  Generative Modeling of Infinite Occluded Objects for Compositional Scene Representation , 2019, ICML.

[34]  Scott P. Johnson How Infants Learn About the Visual World , 2010, Cogn. Sci..

[35]  R A McCarthy,et al.  The neuropsychology of object constancy , 1997, Journal of the International Neuropsychological Society.

[36]  Tim Salimans,et al.  Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression , 2012, ArXiv.

[37]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[38]  M. Cugmas,et al.  On comparing partitions , 2015 .

[39]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yee Whye Teh,et al.  Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.