Learning to Compose Visual Relations

The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects and their associated relations. While there has been significant work on designing deep neural networks which may compose individual objects together, less work has been done on composing the individual relations between objects. A principal difficulty is that while the placement of objects is mutually independent, their relations are entangled and dependent on each other. To circumvent this issue, existing works primarily compose relations by utilizing a holistic encoder, in the form of text or graphs. In this work, we instead propose to represent each relation as an unnormalized density (an energy-based model), enabling us to compose separate relations in a factorized manner. We show that such a factorized decomposition allows the model to both generate and edit scenes that have multiple sets of relations more faithfully. We further show that decomposition enables our model to effectively understand the underlying relational scene structure. Project page at: https://composevisualrelations.github.io/

[1]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Yang Lu,et al.  A Theory of Generative ConvNet , 2016, ICML.

[4]  Tao Mei,et al.  Exploiting Relationship for Complex-scene Image Generation , 2021, AAAI.

[5]  Mohammad Norouzi,et al.  Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One , 2019, ICLR.

[6]  Zhijian Ou,et al.  Learning Neural Random Fields with Inclusive Auxiliary Generators , 2018, ArXiv.

[7]  Jianfei Cai,et al.  Scene Graph Generation With External Knowledge and Image Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xiaogang Wang,et al.  PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph , 2019, NeurIPS.

[9]  Tingfa Xu,et al.  LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators , 2019, ICLR.

[10]  Razvan Pascanu,et al.  Interaction Networks for Learning about Objects, Relations and Physics , 2016, NIPS.

[11]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[12]  Tero Karras,et al.  Training Generative Adversarial Networks with Limited Data , 2020, NeurIPS.

[13]  Alexei Bastidas,et al.  Using Scene Graph Context to Improve Image Generation , 2019, ArXiv.

[14]  Andrew M. Dai,et al.  Flow Contrastive Estimation of Energy-Based Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[16]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[17]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Nicola De Cao,et al.  MolGAN: An implicit generative model for small molecular graphs , 2018, ArXiv.

[19]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[20]  Lior Wolf,et al.  Specifying Object Attributes and Relations in Interactive Scene Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Razvan Pascanu,et al.  Discovering objects and their relations from entangled scene representations , 2017, ICLR.

[22]  Silvio Savarese,et al.  iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[23]  Gaurav Mittal,et al.  Interactive Image Generation Using Scene Graphs , 2019, DGS@ICLR.

[24]  Tian Han,et al.  On the Anatomy of MCMC-based Maximum Likelihood Learning of Energy-Based Models , 2019, AAAI.

[25]  Igor Mordatch,et al.  Implicit Generation and Generalization with Energy Based Models , 2018 .

[26]  Yoshua Bengio,et al.  Keep Drawing It: Iterative language-based image generation and editing , 2018, ArXiv.

[27]  Rob Fergus,et al.  Learning Physical Intuition of Block Towers by Example , 2016, ICML.

[28]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[29]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[30]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[31]  Yoshua Bengio,et al.  Deep Directed Generative Models with Energy-Based Probability Estimation , 2016, ArXiv.

[32]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[33]  Le Song,et al.  Exponential Family Estimation via Adversarial Dynamics Embedding , 2019, NeurIPS.

[34]  James J. Kuffner,et al.  Navigation among movable obstacles: real-time reasoning in complex environments , 2004, 4th IEEE/RAS International Conference on Humanoid Robots, 2004..

[35]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[37]  Shuang Li,et al.  Improved Contrastive Divergence Training of Energy Based Models , 2020, ICML.

[38]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Trevor Darrell,et al.  Compositional Video Synthesis with Action Graphs , 2020, ICML.

[40]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Trevor Darrell,et al.  Learning Canonical Representations for Scene Graph to Image Generation , 2019, ECCV.

[42]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[43]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[44]  Greg Mori,et al.  House-GAN: Relational Generative Adversarial Networks for Graph-constrained House Layout Generation , 2020, ECCV.

[45]  Igor Mordatch,et al.  Compositional Visual Generation and Inference with Energy Based Models , 2020, ArXiv.

[46]  Leslie Pack Kaelbling,et al.  Integrated Task and Motion Planning , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[47]  Sergio Gomez Colmenarejo,et al.  Parallel Multiscale Autoregressive Density Estimation , 2017, ICML.