InSeGAN: A Generative Approach to Segmenting Identical Instances in Depth Images

In this paper, we present InSeGAN, an unsupervised 3D generative adversarial network (GAN) for segmenting (nearly) identical instances of rigid objects in depth images. Using an analysis-by-synthesis approach, we design a novel GAN architecture to synthesize a multiple-instance depth image with independent control over each instance. InSeGAN takes in a set of code vectors (e.g., random noise vectors), each encoding the 3D pose of an object that is represented by a learned implicit object template. The generator has two distinct modules. The first module, the instance feature generator, uses each encoded pose to transform the implicit template into a feature map representation of each object instance. The second module, the depth image renderer, aggregates all of the single-instance feature maps output by the first module and generates a multiple-instance depth image. A discriminator distinguishes the generated multiple-instance depth images from the distribution of true depth images. To use our model for instance segmentation, we propose an instance pose encoder that learns to take in a generated depth image and reproduce the pose code vectors for all of the object instances. To evaluate our approach, we introduce a new synthetic dataset, “Insta-10,” consisting of 100,000 depth images, each with 5 instances of an object from one of 10 classes. Our experiments on Insta-10, as well as on real-world noisy depth images, show that InSeGAN achieves state-of-the-art performance, often outperforming prior methods by large margins.

[1]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Anoop Cherian,et al.  Unsupervised Joint 3D Object Model Learning and 6D Pose Estimation for Depth-Based Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[3]  Andrew Zisserman,et al.  Learning To Count Objects in Images , 2010, NIPS.

[4]  Shu Liu,et al.  Path Aggregation Network for Instance Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Rita Cucchiara,et al.  SIFT-Based Segmentation of Multiple Instances of Low-Textured Objects , 2013 .

[6]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[7]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  R. Jonker,et al.  Improving the Hungarian assignment algorithm , 1986 .

[9]  Katsushi Ikeuchi,et al.  Generating an interpretation tree from a CAD model for 3D-object recognition in bin-picking tasks , 1987, International Journal of Computer Vision.

[10]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[11]  Kenneth Y. Goldberg,et al.  Learning Deep Policies for Robot Bin Picking by Simulating Robust Grasping Sequences , 2017, CoRL.

[12]  Christoph H. Lampert,et al.  Unsupervised object-centric video generation and decomposition in 3D , 2020, NeurIPS.

[13]  Leonidas J. Guibas,et al.  GSPN: Generative Shape Proposal Network for 3D Instance Segmentation in Point Cloud , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ulrich Neumann,et al.  SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Gordon Wetzstein,et al.  DeepVoxels: Learning Persistent 3D Feature Embeddings , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Akira Nakamura,et al.  Probabilistic approach for object bin picking approximated by cylinders , 2013, 2013 IEEE International Conference on Robotics and Automation.

[20]  Daniel King,et al.  Fetch & Freight : Standard Platforms for Service Robot Applications , 2016 .

[21]  Dieter Fox,et al.  Unseen Object Instance Segmentation for Robotic Environments , 2020, IEEE Transactions on Robotics.

[22]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  A. M. Hafiz,et al.  A survey on instance segmentation: state of the art , 2020, International Journal of Multimedia Information Retrieval.

[24]  Ronan Collobert,et al.  Learning to Segment Object Candidates , 2015, NIPS.

[25]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[26]  Yongxiang Wu,et al.  Deep instance segmentation and 6D object pose estimation in cluttered scenes for robotic autonomous grasping , 2020, Ind. Robot.

[27]  Eric Brachmann,et al.  iPose: Instance-Aware 6D Pose Estimation of Partly Occluded Objects , 2017, ACCV.

[28]  Manuel Graña,et al.  Current Research Trends in Robot Grasping and Bin Picking , 2018, SOCO-CISIS-ICEUTE.

[29]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Dirk Buchholz Bin-Picking: New Approaches for a Classical Problem , 2015 .

[32]  Dieter Fox,et al.  Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation , 2020, ArXiv.

[33]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[34]  智一 吉田,et al.  Efficient Graph-Based Image Segmentationを用いた圃場図自動作成手法の検討 , 2014 .

[35]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[36]  Hongyuan Zha,et al.  A Fast Proximal Point Method for Computing Exact Wasserstein Distance , 2018, UAI.

[37]  Matthew Botvinick,et al.  MONet: Unsupervised Scene Decomposition and Representation , 2019, ArXiv.

[38]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[39]  Levent Tunçel,et al.  Optimization algorithms on matrix manifolds , 2009, Math. Comput..

[40]  Klaus Greff,et al.  Multi-Object Representation Learning with Iterative Variational Inference , 2019, ICML.

[41]  Manolis I. A. Lourakis,et al.  T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-Less Objects , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[42]  Yiyi Liao,et al.  Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[44]  Du Q. Huynh,et al.  Metrics for 3D Rotations: Comparison and Analysis , 2009, Journal of Mathematical Imaging and Vision.

[45]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[46]  Tobias Ritschel,et al.  Escaping Plato’s Cave: 3D Shape From Adversarial Rendering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Pascal Fua,et al.  Segmentation-Driven 6D Object Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Lydia E. Kavraki,et al.  Guided Expansive Spaces Trees: a search strategy for motion- and cost-constrained state spaces , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[50]  Yong-Liang Yang,et al.  HoloGAN: Unsupervised Learning of 3D Representations From Natural Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).