Static and Animated 3D Scene Generation from Free-form Text Descriptions.

Generating coherent and useful image/video scenes from a free-form textual description is technically a very difficult problem to handle. Textual description of the same scene can vary greatly from person to person, or sometimes even for the same person from time to time. As the choice of words and syntax vary while preparing a textual description, it is challenging for the system to reliably produce a consistently desirable output from different forms of language input. The prior works of scene generation have been mostly confined to rigorous sentence structures of text input which restrict the freedom of users to write description. In our work, we study a new pipeline that aims to generate static as well as animated 3D scenes from different types of free-form textual scene description without any major restriction. In particular, to keep our study practical and tractable, we focus on a small subspace of all possible 3D scenes, containing various combinations of cube, cylinder and sphere. We design a two-stage pipeline. In the first stage, we encode the free-form text using an encoder-decoder neural architecture. In the second stage, we generate a 3D scene based on the generated encoding. Our neural architecture exploits state-of-the-art language model as encoder to leverage rich contextual encoding and a new multi-head decoder to predict multiple features of an object in the scene simultaneously. For our experiments, we generate a large synthetic data-set which contains 13,00,000 and 14,00,000 samples of unique static and animated scene descriptions, respectively. We achieve 98.427% accuracy on test data set in detecting the 3D objects features successfully. Our work shows a proof of concept of one approach towards solving the problem, and we believe with enough training data, the same pipeline can be expanded to handle even broader set of 3D scene generation problems.

[1]  Pat Hanrahan,et al.  Example-based synthesis of 3D object arrangements , 2012, ACM Trans. Graph..

[2]  Mingkui Tan,et al.  Intelligent Home 3D: Automatic 3D-House Design From Linguistic Descriptions Only , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Walaa Medhat,et al.  Sentiment analysis algorithms and applications: A survey , 2014 .

[4]  Angel X. Chang,et al.  Learning Spatial Knowledge for Text to 3D Scene Generation , 2014, EMNLP.

[5]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Matthias Nießner,et al.  Activity-centric scene synthesis for functional 3D scene modeling , 2015, ACM Trans. Graph..

[8]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Yu Cheng,et al.  StoryGAN: A Sequential Conditional GAN for Story Visualization , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Maneesh Agrawala,et al.  Interactive furniture layout using interior design guidelines , 2011, SIGGRAPH 2011.

[12]  Xinlei Chen,et al.  CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication , 2017, ACL.

[13]  Roy Schwartz,et al.  Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.

[14]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[15]  Yoshua Bengio,et al.  Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Evangelos Kalogerakis,et al.  SceneGraphNet: Neural Message Passing for 3D Indoor Scene Augmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Tao Mei,et al.  To Create What You Tell: Generating Videos from Captions , 2017, ACM Multimedia.

[18]  Lijun Yin,et al.  Real-time automatic 3D scene generation from natural language voice and text descriptions , 2006, MM '06.

[19]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[21]  Shao-Kui Zhang,et al.  Fast 3D Indoor Scene Synthesis with Discrete and Exact Layout Pattern Extraction , 2020, ArXiv.

[22]  Angel X. Chang,et al.  SceneSeer: 3D Scene Design with Natural Language , 2017, ArXiv.

[23]  Yitong Li,et al.  Video Generation From Text , 2017, AAAI.

[24]  Silvio Savarese,et al.  Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings , 2018, ACCV.

[25]  Richard Sproat,et al.  WordsEye: an automatic text-to-scene conversion system , 2001, SIGGRAPH.

[26]  Angel X. Chang,et al.  PlanIT , 2019, ACM Transactions on Graphics.

[27]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Vicente Ordonez,et al.  Text2Scene: Generating Compositional Scenes From Textual Descriptions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Mauro Di Manzo,et al.  Natural Language Input For Scene Generation , 1983, EACL.

[30]  James Pustejovsky,et al.  VoxSim: A Visual Platform for Modeling Motion Language , 2016, COLING.

[31]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Yoshua Bengio,et al.  ChatPainter: Improving Text to Image Generation using Dialogue , 2018, ICLR.

[33]  Claire Cardie,et al.  The Neural Painter: Multi-Turn Image Generation , 2018, ArXiv.

[34]  Taku Komura,et al.  Relationship templates for creating scene variations , 2016, ACM Trans. Graph..

[35]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[36]  Yu Cheng,et al.  Sequential Attention GAN for Interactive Image Editing via Dialogue , 2018, ArXiv.

[37]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[38]  Christopher Potts,et al.  Text to 3D Scene Generation with Rich Lexical Grounding , 2015, ACL.

[39]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[40]  Matthias Nießner,et al.  PiGraphs , 2016, SIGGRAPH ASIA 2016 Virtual Reality meets Physical Reality: Modelling and Simulating Virtual Humans and Environments.

[41]  Ali Farhadi,et al.  Imagine This! Scripts to Compositions to Videos , 2018, ECCV.

[42]  Deva Ramanan,et al.  CATER: A diagnostic dataset for Compositional Actions and TEmporal Reasoning , 2020, ICLR.

[43]  Thomas Lukasiewicz,et al.  Controllable Text-to-Image Generation , 2019, NeurIPS.

[44]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[45]  Sheng Li,et al.  A Review on Deep Learning Techniques Applied to Answer Selection , 2018, COLING.