Diffusion Probabilistic Models for Scene-Scale 3D Categorical Data

In this paper, we learn a diffusion model to generate 3D data on a scene-scale. Specifically, our model crafts a 3D scene consisting of multiple objects, while recent diffusion research has focused on a single object. To realize our goal, we represent a scene with discrete class labels, i.e., categorical distribution, to assign multiple objects into semantic categories. Thus, we extend discrete diffusion models to learn scene-scale categorical distributions. In addition, we validate that a latent diffusion model can reduce computation costs for training and deploying. To the best of our knowledge, our work is the first to apply discrete and latent diffusion for 3D categorical data on a scene-scale. We further propose to perform semantic scene completion (SSC) by learning a conditional distribution using our diffusion model, where the condition is a partial observation in a sparse point cloud. In experiments, we empirically show that our diffusion models not only generate reasonable scenes, but also perform the scene completion task better than a discriminative model. Our code and models are available at https://github.com/zoomin-lee/scene-scale-diffusion

[1]  Prafulla Dhariwal,et al.  Point-E: A System for Generating 3D Point Clouds from Complex Prompts , 2022, ArXiv.

[2]  S. Fidler,et al.  LION: Latent Point Diffusion Models for 3D Shape Generation , 2022, NeurIPS.

[3]  Yu-Chiang Frank Wang,et al.  Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis , 2022, AAAI.

[4]  Jae-Pil Heo,et al.  Local Attention Pyramid for Scene Image Generation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[6]  Joey Wilson,et al.  MotionSC: Data Set and Network for Real-Time Semantic Mapping in Dynamic Environments , 2022, IEEE Robotics and Automation Letters.

[7]  S. Ermon,et al.  GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation , 2022, ICLR.

[8]  L. Gool,et al.  RePaint: Inpainting using Denoising Diffusion Probabilistic Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Tat-Jen Cham,et al.  Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Rianne van den Berg,et al.  Structured Denoising Diffusion Models in Discrete State-Spaces , 2021, NeurIPS.

[13]  Sung-Eui Yoon,et al.  In-N-Out: Towards Good Initialization for Inpainting and Outpainting , 2021, BMVC.

[14]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[15]  Jiajun Wu,et al.  3D Shape Generation and Completion through Point-Voxel Diffusion , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Shitong Luo,et al.  Diffusion Probabilistic Models for 3D Point Cloud Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[18]  Didrik Nielsen,et al.  Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions , 2021, NeurIPS.

[19]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ran Cheng,et al.  S3CNet: A Sparse Semantic Scene Completion Network for LiDAR Point Clouds , 2020, CoRL.

[21]  Shuguang Cui,et al.  Sparse Single Sweep LiDAR Point Cloud Segmentation via Learning Contextual Shape Priors from Scene Completion , 2020, AAAI.

[22]  Xinge Zhu,et al.  Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dariu M. Gavrila,et al.  Semantic Scene Completion Using Local Deep Implicit Functions on LiDAR Data , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Anne Verroust-Blondet,et al.  LMSCNet: Lightweight Multiscale 3D Semantic Completion , 2020, 2020 International Conference on 3D Vision (3DV).

[25]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[26]  Juergen Gall,et al.  3D Semantic Scene Completion from a Single Depth Image Using Adversarial Training , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[27]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[28]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Gernot Riegler,et al.  OctNet: Learning Deep 3D Representations at High Resolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[32]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.