Emergence of Object Segmentation in Perturbed Generative Models

We introduce a novel framework to build a model that can learn how to segment objects from a collection of images without any human annotation. Our method builds on the observation that the location of object segments can be perturbed locally relative to a given background without affecting the realism of a scene. Our approach is to first train a generative model of a layered scene. The layered representation consists of a background image, a foreground image and the mask of the foreground. A composite image is then obtained by overlaying the masked foreground image onto the background. The generative model is trained in an adversarial fashion against a discriminator, which forces the generative model to produce realistic composite images. To force the generator to learn a representation where the foreground layer corresponds to an object, we perturb the output of the generative model by introducing a random shift of both the foreground image and mask relative to the background. Because the generator is unaware of the shift before computing its output, it must produce layered representations that are realistic for any such random perturbation. Finally, we learn to segment an image by defining an autoencoder consisting of an encoder, which we train, and the pre-trained generator as the decoder, which we freeze. The encoder maps an image to a feature vector, which is fed as input to the generator to give a composite image matching the original input image. Because the generator outputs an explicit layered representation of the scene, the encoder learns to detect and segment objects. We demonstrate this framework on real images of several object categories.

[1]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[2]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[3]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[4]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[5]  Asako Kanezaki,et al.  Unsupervised Image Segmentation by Backpropagation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[7]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ning Xu,et al.  Deep Image Matting , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jürgen Schmidhuber,et al.  Neural Expectation Maximization , 2017, NIPS.

[11]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[12]  Xu Ji,et al.  Invariant Information Clustering for Unsupervised Image Classification and Segmentation , 2019 .

[13]  Trevor Darrell,et al.  Learning to Segment Every Thing , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Sjoerd van Steenkiste,et al.  Investigating object compositionality in Generative Adversarial Networks , 2020, Neural Networks.

[16]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[17]  Sjoerd van Steenkiste,et al.  A Case for Object Compositionality in Deep Generative Models of Images , 2018, ArXiv.

[18]  Bernt Schiele,et al.  Simple Does It: Weakly Supervised Instance and Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Brian Kulis,et al.  W-Net: A Deep Model for Fully Unsupervised Image Segmentation , 2017, ArXiv.

[20]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[21]  Byoung-Tak Zhang,et al.  Generating Images Part by Part with Composite Generative Adversarial Networks , 2016, ArXiv.

[22]  Dhruv Batra,et al.  LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation , 2016, ICLR.

[23]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[24]  Jaehwan Kim,et al.  Unsupervised Salient Object Matting , 2015, ACIVS.

[25]  Vikas Gupta,et al.  Automatic trimap generation for image matting , 2016, 2016 International Conference on Signal and Information Processing (IConSIP).

[26]  Matthew A. Brown,et al.  Learning to Segment via Cut-and-Paste , 2018, ECCV.

[27]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[29]  Harri Valpola,et al.  Tagger: Deep Unsupervised Perceptual Grouping , 2016, NIPS.

[30]  Aljoscha Smolic,et al.  AlphaGAN: Generative adversarial networks for natural image matting , 2018, BMVC.

[31]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Sergey I. Nikolenko,et al.  SEIGAN: Towards Compositional Image Generation by Simultaneously Learning to Segment, Enhance, and Inpaint , 2018, ArXiv.

[33]  Terrence J. Sejnowski,et al.  Unsupervised Learning , 2018, Encyclopedia of GIS.

[34]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.