Learning Spatial Knowledge for Text to 3D Scene Generation

We address the grounding of natural language to concrete spatial constraints, and inference of implicit pragmatics in 3D environments. We apply our approach to the task of text-to-3D scene generation. We present a representation for common sense spatial knowledge and an approach to extract it from 3D scene data. In text-to3D scene generation, a user provides as input natural language text from which we extract explicit constraints on the objects that should appear in the scene. The main innovation of this work is to show how to augment these explicit constraints with learned spatial knowledge to infer missing objects and likely layouts for the objects in the scene. We demonstrate that spatial knowledge is useful for interpreting natural language and show examples of learned knowledge and generated 3D scenes.

[1]  Luke S. Zettlemoyer,et al.  Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[2]  Jayant Krishnamurthy,et al.  Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World , 2013, TACL.

[3]  Daniel Jurafsky,et al.  Learning to Follow Navigational Directions , 2010, ACL.

[4]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[5]  Luke S. Zettlemoyer,et al.  A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[6]  Maneesh Agrawala,et al.  Interactive furniture layout using interior design guidelines , 2011, SIGGRAPH 2011.

[7]  Lucy Vanderwende,et al.  Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[9]  Christopher Potts,et al.  Implicatures and Nested Beliefs in Approximate Decentralized-POMDPs , 2013, ACL.

[10]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[11]  Russell H. Taylor,et al.  Superfaces: polygonal mesh simplification with bounded error , 1996, IEEE Computer Graphics and Applications.

[12]  Julia Hirschberg,et al.  Spatial Relations in Text-to-Scene Conversion , 2010 .

[13]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[14]  Pat Hanrahan,et al.  On being the right scale: sizing large collections of 3D models , 2014, SIGGRAPH ASIA Indoor Scene Understanding Where Graphics Meets Vision.

[15]  Dan Klein,et al.  A Game-Theoretic Approach to Generating Spatial Descriptions , 2010, EMNLP.

[16]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[17]  Terry Winograd,et al.  Understanding natural language , 1974 .

[18]  Bob Coyne,et al.  Annotation Tools and Knowledge Representation for a Text-To-Scene System , 2012, COLING.

[19]  Pat Hanrahan,et al.  Example-based synthesis of 3D object arrangements , 2012, ACM Trans. Graph..

[20]  Luke S. Zettlemoyer,et al.  Learning Distributions over Logical Forms for Referring Expression Generation , 2013, EMNLP.

[21]  Richard Sproat,et al.  WordsEye: an automatic text-to-scene conversion system , 2001, SIGGRAPH.

[22]  Eric Yeh,et al.  Learning Alignments and Leveraging Natural Logic , 2007, ACL-PASCAL@ACL.