VIM: Variational Independent Modules for Video Prediction

We introduce a variational inference model called Variational Independent Modules (VIM) for sequential data that learns and infers latent representations as a set of objects and discovers modular causal mechanisms over these objects. These mechanisms - which we call modules - are independently parametrized, define the stochastic transitions of entities and are shared across entities. At each time step our model infers from a low-level input sequence a high-level sequence of categorical latent variables that select which transition modules are applied to which high-level object. We evaluate this model in video prediction tasks where the goal is to predict multi-modal future events given previous observations. We demonstrate empirically that VIM can model 2D visual sequences in an interpretable way and is able to identify the underlying dynamically instantiated mechanisms of the generation process. We additionally show that the learnt modules can be composed at test time to generalize to out-of-distribution observations.

[1]  Nan Rosemary Ke,et al.  Neural Production Systems , 2021, NeurIPS.

[2]  Yoshua Bengio,et al.  Inductive biases for deep learning of higher-level cognition , 2020, Proceedings of the Royal Society A.

[3]  Sungjin Ahn,et al.  Improving Generative Imagination in Object-Centric World Models , 2020, ICML.

[4]  Yoshua Bengio,et al.  Object Files and Schemata: Factorizing Declarative and Procedural Knowledge in Dynamical Systems , 2020, ArXiv.

[5]  Thomas Kipf,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[6]  Joelle Pineau,et al.  Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking , 2019, AAAI.

[7]  K. Kersting,et al.  Structured Object-Aware Physics Prediction for Video Modeling and Planning , 2019, ICLR.

[8]  Gerard de Melo,et al.  Scalable Object-Oriented Sequential Generative Models , 2019, ICLR 2020.

[9]  Sergey Levine,et al.  Recurrent Independent Mechanisms , 2019, ICLR.

[10]  Joelle Pineau,et al.  Spatially Invariant Unsupervised Object Detection with Convolutional Neural Networks , 2019, AAAI.

[11]  Jakob Uszkoreit,et al.  Scaling Autoregressive Video Models , 2019, ICLR.

[12]  Kristian Kersting,et al.  Faster Attend-Infer-Repeat with Tractable Probabilistic Models , 2019, ICML.

[13]  Aaron C. Courville,et al.  Improved Conditional VRNNs for Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[15]  Christopher Joseph Pal,et al.  A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms , 2019, ICLR.

[16]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[17]  Bernhard Schölkopf,et al.  Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations , 2018, ICML.

[18]  Ruben Villegas,et al.  Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[19]  Yee Whye Teh,et al.  Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.

[20]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[21]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[22]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[23]  Yoshua Bengio The Consciousness Prior , 2017, ArXiv.

[24]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[27]  Sergio Gomez Colmenarejo,et al.  Parallel Multiscale Autoregressive Density Estimation , 2017, ICML.

[28]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[29]  Alex Graves,et al.  Video Pixel Networks , 2016, ICML.

[30]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[31]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[32]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[33]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[34]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[36]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[37]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[38]  Aaron C. Courville,et al.  Generative adversarial networks , 2014, Commun. ACM.

[39]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[40]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[41]  A. Clark Whatever next? Predictive brains, situated agents, and the future of cognitive science. , 2013, The Behavioral and brain sciences.

[42]  Hugo Larochelle,et al.  The Neural Autoregressive Distribution Estimator , 2011, AISTATS.

[43]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[44]  E. Spelke,et al.  Gestalt Relations and Object Perception: A Developmental Study , 1993, Perception.

[45]  D. Kahneman,et al.  The reviewing of object files: Object-specific integration of information , 1992, Cognitive Psychology.

[46]  E. Maguire,et al.  Memory , Imagination , and Predicting the Future : A Common Brain Mechanism ? , 2013 .

[47]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .