CLIP-Layout: Style-Consistent Indoor Scene Synthesis with Semantic Furniture Embedding

Indoor scene synthesis involves automatically picking and placing furniture appropriately on a floor plan, so that the scene looks realistic and is functionally plausible. Such scenes can serve as a home for immersive 3D experiences, or be used to train embodied agents. Existing methods for this task rely on labeled categories of furniture, e.g. bed, chair or table, to generate contextually relevant combinations of furniture. Whether heuristic or learned, these methods ignore instance-level attributes of objects such as color and style, and as a result may produce visually less coherent scenes. In this paper, we introduce an auto-regressive scene model which can output instance-level predictions, making use of general purpose image embedding based on CLIP. This allows us to learn visual correspondences such as matching color and style, and produce more plausible and aesthetically pleasing scenes. Evaluated on the 3D-FRONT dataset, our model achieves SOTA results in scene generation and improves auto-completion metrics by over 50%. Moreover, our embedding-based approach enables zero-shot text-guided scene generation and editing, which easily generalizes to furniture not seen at training time.

[1]  Guangcong Wang,et al.  SceneDreamer: Unbounded 3D Scene Generation From 2D Image Collections , 2023, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Michael J. Black,et al.  MIME: Human-Aware 3D Scene Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  S. Fidler,et al.  LION: Latent Point Diffusion Models for 3D Shape Generation , 2022, NeurIPS.

[4]  Ben Poole,et al.  DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.

[5]  S. Fidler,et al.  GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images , 2022, NeurIPS.

[6]  Walter A. Talbott,et al.  GAUDI: A Neural Architect for Immersive 3D Scene Generation , 2022, NeurIPS.

[7]  Ali Farhadi,et al.  ProcTHOR: Large-Scale Embodied AI Using Procedural Generation , 2022, NeurIPS.

[8]  Junyan Zhu,et al.  On Aliased Resizing and Surprising Subtleties in GAN Evaluation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  M. Nießner,et al.  3DILG: Irregular Latent Grids for 3D Generative Modeling , 2022, NeurIPS.

[10]  A. Yang,et al.  Mutual Scene Synthesis for Mixed Reality Telepresence , 2022, ArXiv.

[11]  Paul Guerrero,et al.  LayoutEnhancer: Generating Good Indoor Layouts from Imperfect Data , 2022, SIGGRAPH Asia.

[12]  Federico Tombari,et al.  Neural Fields in Visual Computing and Beyond , 2021, Comput. Graph. Forum.

[13]  Hang Chu,et al.  CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Sanja Fidler,et al.  ATISS: Autoregressive Transformers for Indoor Scene Synthesis , 2021, NeurIPS.

[15]  A. Yang,et al.  Contextual Scene Augmentation and Synthesis via GSACNet , 2021, ArXiv.

[16]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[17]  Chandan Yeshwanth,et al.  SceneFormer: Indoor Scene Generation with Transformers , 2020, 2021 International Conference on 3D Vision (3DV).

[18]  Peng Liu,et al.  3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Lin Gao,et al.  3D-FUTURE: 3D Furniture Shape with TextURE , 2020, International Journal of Computer Vision.

[20]  Luisa Caldas,et al.  SceneGen: Generative Contextual Scene Augmentation using Scene Graph Priors , 2020, ArXiv.

[21]  Sanja Fidler,et al.  Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation , 2020, ECCV.

[22]  Franccois Fleuret,et al.  Fast Transformers with Clustered Attention , 2020, NeurIPS.

[23]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[24]  Jiajun Wu,et al.  End-to-End Optimization of Scene Layout , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Shao-Kui Zhang,et al.  Fast 3D Indoor Scene Synthesis with Discrete and Exact Layout Pattern Extraction , 2020, ArXiv.

[26]  Ian Reid,et al.  SG-VAE: Scene Grammar Variational Autoencoder to Generate New Indoor Scenes , 2019, ECCV.

[27]  Chongyang Ma,et al.  Deep Generative Modeling for Scene Synthesis via Hybrid Representations , 2018, ACM Trans. Graph..

[28]  Angel X. Chang,et al.  PlanIT: planning and instantiating indoor scenes with relation graph and spatial prior networks , 2019, ACM Trans. Graph..

[29]  P. Hall,et al.  A Survey of 3D Indoor Scene Synthesis , 2019, Journal of Computer Science and Technology.

[30]  Sanja Fidler,et al.  Meta-Sim: Learning to Generate Synthetic Datasets , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Kai Wang,et al.  Fast and Flexible Indoor Scene Synthesis via Deep Convolutional Generative Models , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Stanley T. Birchfield,et al.  Structured Domain Randomization: Bridging the Reality Gap by Context-Aware Synthetic Data , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[33]  Daniel Cohen-Or,et al.  GRAINS , 2018, ACM Trans. Graph..

[34]  Angel X. Chang,et al.  Deep convolutional priors for indoor scene synthesis , 2018, ACM Trans. Graph..

[35]  Chenfanfu Jiang,et al.  Human-Centric Indoor Scene Synthesis Using Stochastic Grammar , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[38]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Maneesh Agrawala,et al.  Interactive furniture layout using interior design guidelines , 2011, ACM Trans. Graph..

[41]  Radomír Mech,et al.  Metropolis procedural modeling , 2011, TOGS.

[42]  T. Germer,et al.  Procedural Arrangement of Furniture for Real‐Time Walkthroughs , 2009, Comput. Graph. Forum.

[43]  Luc Van Gool,et al.  Procedural modeling of buildings , 2006, ACM Trans. Graph..

[44]  James Stewart,et al.  Constraint-Based Automatic Placement for Scene Composition , 2002, Graphics Interface.

[45]  Carlo H. Séquin,et al.  Object associations: a simple and practical approach to virtual 3D manipulation , 1995, I3D '95.