Language-driven synthesis of 3D scenes from scene databases

We introduce a novel framework for using natural language to generate and edit 3D indoor scenes, harnessing scene semantics and text-scene grounding knowledge learned from large annotated 3D scene databases. The advantage of natural language editing interfaces is strongest when performing semantic operations at the sub-scene level, acting on groups of objects. We learn how to manipulate these sub-scenes by analyzing existing 3D scenes. We perform edits by first parsing a natural language command from the user and transforming it into a semantic scene graph that is used to retrieve corresponding sub-scenes from the databases that match the command. We then augment this retrieved sub-scene by incorporating other objects that may be implied by the scene context. Finally, a new 3D scene is synthesized by aligning the augmented sub-scene with the user's current scene, where new objects are spliced into the environment, possibly triggering appropriate adjustments to the existing scene arrangement. A suggestive modeling interface with multiple interpretations of user commands is used to alleviate ambiguities in natural language. We conduct studies comparing our approach against both prior text-to-scene work and artist-made scenes and find that our method significantly outperforms prior work and is comparable to handmade scenes even when complex and varied natural sentences are used.

[1]  Pat Hanrahan,et al.  Example-based synthesis of 3D object arrangements , 2012, ACM Trans. Graph..

[2]  Richard Sproat,et al.  WordsEye: an automatic text-to-scene conversion system , 2001, SIGGRAPH.

[3]  Thomas Malzbender,et al.  A Survey of Methods for Volumetric Scene Reconstruction from Photographs , 2001, VG.

[4]  Yun Jiang,et al.  Learning Object Arrangements in 3D Scenes using Human Context , 2012, ICML.

[5]  Matthias Nießner,et al.  PiGraphs , 2016, ACM Trans. Graph..

[6]  Angel X. Chang,et al.  Learning Spatial Knowledge for Text to 3D Scene Generation , 2014, EMNLP.

[7]  Chi-Keung Tang,et al.  Make it home: automatic optimization of furniture arrangement , 2011, ACM Trans. Graph..

[8]  Hao Su,et al.  3D attention-driven depth acquisition for object identification , 2016, ACM Trans. Graph..

[9]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[10]  Duc Thanh Nguyen,et al.  SceneNN: A Scene Meshes Dataset with aNNotations , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[11]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Rui Ma,et al.  Organizing heterogeneous scene collections through contextual focal points , 2014, ACM Trans. Graph..

[13]  Ligang Liu,et al.  Interaction context (ICON) , 2015, ACM Trans. Graph..

[14]  Rui Ma,et al.  Action-driven 3D indoor scene evolution , 2016, ACM Trans. Graph..

[15]  Matthias Nießner,et al.  Activity-centric scene synthesis for functional 3D scene modeling , 2015, ACM Trans. Graph..

[16]  Pat Hanrahan,et al.  Characterizing structural relationships in scenes using graph kernels , 2011, ACM Trans. Graph..

[17]  Demetri Terzopoulos,et al.  The Clutterpalette: An Interactive Tool for Detailing Indoor Scenes , 2016, IEEE Transactions on Visualization and Computer Graphics.

[18]  Dan Klein,et al.  Grounding spatial relations for human-robot interaction , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[19]  Maneesh Agrawala,et al.  Interactive furniture layout using interior design guidelines , 2011, SIGGRAPH 2011.

[20]  Lucy Vanderwende,et al.  Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  H. Zhang,et al.  Learning 3D Scene Synthesis from Annotated RGB‐D Images , 2016, Comput. Graph. Forum.

[22]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[23]  A. Agresti,et al.  Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions , 1998 .

[24]  Angel X. Chang,et al.  SceneSeer: 3D Scene Design with Natural Language , 2017, ArXiv.

[25]  Leonidas J. Guibas,et al.  Acquiring 3D indoor environments with variability and repetition , 2012, ACM Trans. Graph..

[26]  Lijun Yin,et al.  Real-time automatic 3D scene generation from natural language voice and text descriptions , 2006, MM '06.

[27]  Jianxiong Xiao,et al.  3 D reconstruction is not just a low-level task : retrospect and survey , 2013 .

[28]  Christopher Potts,et al.  Text to 3D Scene Generation with Rich Lexical Grounding , 2015, ACL.

[29]  Moritz Tenorth,et al.  KnowRob: A knowledge processing infrastructure for cognition-enabled robots , 2013, Int. J. Robotics Res..

[30]  Pat Hanrahan,et al.  Semantically-enriched 3D models for common-sense knowledge , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31]  Bob Coyne,et al.  Annotation Tools and Knowledge Representation for a Text-To-Scene System , 2012, COLING.

[32]  Wilmot Li,et al.  Style compatibility for 3D furniture models , 2015, ACM Trans. Graph..

[33]  Angel X. Chang,et al.  Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation , 2014 .

[34]  Kevin Lee,et al.  Tell me Dave: Context-sensitive grounding of natural language to manipulation instructions , 2014, Int. J. Robotics Res..

[35]  Richard Szeliski,et al.  A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[36]  Ariel Shamir,et al.  Filling Your Shelves: Synthesizing Diverse Style-Preserving Artifact Arrangements , 2014, IEEE Transactions on Visualization and Computer Graphics.

[37]  Kun Zhou,et al.  An interactive approach to semantic modeling of indoor scenes with an RGBD camera , 2012, ACM Trans. Graph..

[38]  Derek Hoiem,et al.  Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[39]  Angel X. Chang,et al.  Deep convolutional priors for indoor scene synthesis , 2018, ACM Trans. Graph..

[40]  Jianxiong Xiao,et al.  SUN RGB-D: A RGB-D scene understanding benchmark suite , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).