Learned Equivariant Rendering without Transformation Supervision

We propose a self-supervised framework to learn scene representations from video that are automatically delineated into objects and background. Our method relies on moving objects being equivariant with respect to their transformation across frames and the background being constant. After training, we can manipulate and render the scenes in real time to create unseen combinations of objects, transformations, and backgrounds. We show results on moving MNIST with backgrounds.

[1]  Yuting Zhang,et al.  Deep Visual Analogy-Making , 2015, NIPS.

[2]  Gabriel J. Brostow,et al.  Interpretable Transformations with Encoder-Decoder Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[4]  Carlos Guestrin,et al.  Equivariant Neural Rendering , 2020, ICML.

[5]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[6]  Sergey Tulyakov,et al.  Transformable Bottleneck Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).