Unmasking the Inductive Biases of Unsupervised Object Representations for Video Sequences

Perceiving the world in terms of objects is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models have been evaluated with respect to different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of individual objects. In this paper, we argue that the established evaluation protocol of multi-object tracking tests precisely these perceptual qualities and we propose a new benchmark dataset based on procedurally generated video sequences. Using this benchmark, we compare the perceptual abilities of three state-of-the-art unsupervised object-centric learning approaches. Towards this goal, we propose a video-extension of MONet, a seminal object-centric model for static scenes, and compare it to two recent video models: OP3, which exploits clustering via spatial mixture models, and TBA, which uses an explicit factorization via spatial transformers. Our results indicate that architectures which employ unconstrained latent representations based on per-object variational autoencoders and full-image object masks are able to learn more powerful representations in terms of object detection, segmentation and tracking than the explicitly parameterized spatial transformer based architecture. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios, suggesting that our synthetic video benchmark may provide fruitful guidance towards learning more robust object-centric video representations.

[1]  Alexander Lerchner,et al.  COBRA: Data-Efficient Model-Based RL through Unsupervised Object Discovery and Curiosity-Driven Exploration , 2019, ArXiv.

[2]  Bin Li,et al.  Generative Modeling of Infinite Occluded Objects for Compositional Scene Representation , 2019, ICML.

[3]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Jürgen Schmidhuber,et al.  Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions , 2018, ICLR.

[6]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[7]  Alexander Lerchner,et al.  Spatial Broadcast Decoder: A Simple Architecture for Learning Disentangled Representations in VAEs , 2019, ArXiv.

[8]  Sungjin Ahn,et al.  SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition , 2020, ICLR.

[9]  Ingmar Posner,et al.  GENESIS: Generative Scene Inference and Sampling with Object-Centric Latent Representations , 2019, ICLR.

[10]  Jürgen Schmidhuber,et al.  Neural Expectation Maximization , 2017, NIPS.

[11]  Yee Whye Teh,et al.  Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.

[12]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[15]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[16]  Luca Iocchi,et al.  Independent multimodal background subtraction , 2012, CompIMAGE.

[17]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[18]  Harri Valpola,et al.  Tagger: Deep Unsupervised Perceptual Grouping , 2016, NIPS.

[19]  Joelle Pineau,et al.  Spatially Invariant Unsupervised Object Detection with Convolutional Neural Networks , 2019, AAAI.

[20]  Eero P. Simoncelli,et al.  Perceptual straightening of natural videos , 2019, Nature Neuroscience.

[21]  Peter V. Gehler,et al.  Towards causal generative scene models via competition of experts , 2020, ArXiv.

[22]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[23]  David Barber,et al.  Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jiajun Wu,et al.  Entity Abstraction in Visual Model-Based Reinforcement Learning , 2019, CoRL.

[25]  Yisong Yue,et al.  Iterative Amortized Inference , 2018, ICML.

[26]  Rainer Stiefelhagen,et al.  Evaluating Multiple Object Tracking Performance: The CLEAR MOT Metrics , 2008, EURASIP J. Image Video Process..

[27]  Andreas Geiger,et al.  MOTS: Multi-Object Tracking and Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[29]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[30]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[31]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.