MarioNette: Self-Supervised Sprite Learning

Visual content often contains recurring elements. Text is made up of glyphs from the same font, animations, such as cartoons or video games, are composed of sprites moving around the screen, and natural videos frequently have repeated views of objects. In this paper, we propose a deep learning approach for obtaining a graphically disentangled representation of recurring elements in a completely self-supervised manner. By jointly learning a dictionary of texture patches and training a network that places them onto a canvas, we effectively deconstruct sprite-based content into a sparse, consistent, and interpretable representation that can be easily used in downstream tasks. Our framework offers a promising approach for discovering recurring patterns in image collections without supervision.

[1]  Wilmot Li,et al.  Discovering pattern structure using differentiable compositing , 2020, ACM Trans. Graph..

[2]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[3]  Brendan J. Frey,et al.  Epitomic analysis of appearance and shape , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  Tom Duff,et al.  Compositing digital images , 1984, SIGGRAPH.

[5]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[6]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[7]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[8]  Michal Irani,et al.  “Double-DIP”: Unsupervised Image Decomposition via Coupled Deep-Image-Priors , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Julian Togelius,et al.  Deep Learning for Video Game Playing , 2017, IEEE Transactions on Games.

[10]  Marc Pollefeys,et al.  Unmixing-Based Soft Color Segmentation for Image Manipulation , 2017, ACM Trans. Graph..

[11]  Frédo Durand,et al.  Motion magnification , 2005, ACM Trans. Graph..

[12]  Bolei Zhou,et al.  GAN Dissection: Visualizing and Understanding Generative Adversarial Networks , 2018, ICLR.

[13]  Daniel P. Huttenlocher,et al.  Digipaper: a versatile color document image representation , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[14]  Ting-Chun Wang,et al.  Partial Convolution based Padding , 2018, ArXiv.

[15]  Yann LeCun,et al.  DjVu: analyzing and compressing scanned documents for Internet distribution , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[16]  Joelle Pineau,et al.  Exploiting Spatial Invariance for Scalable Unsupervised Object Tracking , 2019, AAAI.

[17]  Guillermo Sapiro,et al.  Online Learning for Matrix Factorization and Sparse Coding , 2009, J. Mach. Learn. Res..

[18]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[19]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[20]  Edward H. Adelson,et al.  Representing moving images with layers , 1994, IEEE Trans. Image Process..

[21]  Frédo Durand,et al.  Eulerian video magnification for revealing subtle changes in the world , 2012, ACM Trans. Graph..

[22]  Brendan J. Frey,et al.  Learning flexible sprites in video layers , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[23]  Yee Whye Teh,et al.  Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects , 2018, NeurIPS.

[24]  Kevin Murphy,et al.  Efficient inference in occlusion-aware generative models of images , 2015, ArXiv.

[25]  Tae-Hyun Oh,et al.  Semantic soft segmentation , 2018, ACM Trans. Graph..

[26]  Julian Togelius,et al.  Procedural Content Generation via Machine Learning (PCGML) , 2017, IEEE Transactions on Games.

[27]  Mathieu Aubry,et al.  Unsupervised Image Decomposition in Vector Layers , 2018, 2020 IEEE International Conference on Image Processing (ICIP).

[28]  K. Manjunatha Chari,et al.  A review on motion estimation in video compression , 2015, 2015 International Conference on Signal Processing and Communication Engineering Systems.

[29]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[30]  Ersin Yumer,et al.  ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Vladimir G. Kim,et al.  Deep Parametric Shape Predictions Using Distance Fields , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[33]  Bolei Zhou,et al.  Seeing What a GAN Cannot Generate , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  William T. Freeman,et al.  Omnimatte: Associating Objects and Their Effects in Video , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Stephan Mandt,et al.  Deep Generative Video Compression , 2018, NeurIPS.

[36]  Jaakko Lehtinen,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[38]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Xiaoyun Zhang,et al.  DVC: An End-To-End Deep Video Compression Framework , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Tzu-Mao Li,et al.  Differentiable vector graphics rasterization for editing and learning , 2020, ACM Trans. Graph..

[41]  Sanja Fidler,et al.  Learning to Simulate Dynamic Environments With GameGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Brendan J. Frey,et al.  Generative Model for Layers of Appearance and Deformation , 2005, AISTATS.

[43]  Hao He,et al.  Exposure , 2017, ACM Trans. Graph..

[44]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[45]  Frédo Durand,et al.  Differentiable programming for image processing and deep learning in halide , 2018, ACM Trans. Graph..

[46]  Justus Thies,et al.  Deferred Neural Rendering: Image Synthesis using Neural Textures , 2019 .

[47]  Jaakko Lehtinen,et al.  GANSpace: Discovering Interpretable GAN Controls , 2020, NeurIPS.

[48]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[49]  Joost Visser,et al.  Stampnet: Unsupervised Multi-Class Object Discovery , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[50]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[51]  Sungjin Ahn,et al.  SPACE: Unsupervised Object-Oriented Scene Representation via Spatial Attention and Decomposition , 2020, ICLR.

[52]  Alexei A. Efros,et al.  Investigating Human Priors for Playing Video Games , 2018, ICML.

[53]  David Salesin,et al.  Layered neural rendering for retiming people in video , 2020, ACM Trans. Graph..

[54]  David Barber,et al.  Tracking by Animation: Unsupervised Learning of Multi-Object Attentive Trackers , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Ting-Chun Wang,et al.  Image Inpainting for Irregular Holes Using Partial Convolutions , 2018, ECCV.