Text2SceneVR: Generating Hypertexts with VAnnotatoR as a Pre-processing Step for Text2Scene Systems

The automatic generation of digital scenes from texts is a central task of computer science. This task requires a kind of text comprehension, the automation of which is tied to the availability of sufficiently large, diverse and deeply annotated data, which is freely available. This paper introduces Text2SceneVR, a system that addresses this bottleneck problem by allowing its users to create a sort of spatial hypertexts in Virtual Reality (VR). We describe Text2SceneVR's data model, its user interface and a number of problems related to the implicitness of natural language in the manifestation of spatial relations that Text2SceneVR aims to address while trying to remain language independent. Finally, we present a user study with which we evaluated Text2SceneVR.

[1]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[2]  Alexander Mehler,et al.  VAnnotatoR: A Framework for Generating Multimodal Hypertexts , 2018, HT.

[3]  Frank M. Shipman,et al.  Spatial hypertext: designing for change , 1995, CACM.

[4]  Lucy Vanderwende,et al.  Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Joan Condell,et al.  SceneMaker: Automatic Visualisation of Screenplays , 2009, KI.

[6]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[7]  Nour Ali,et al.  ShyWiki-A spatial hypertext wiki , 2008, Int. Sym. Wikis.

[8]  Mark Bernstein,et al.  Can we talk about spatial hypertext , 2011, HT '11.

[9]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[10]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[11]  Richard Sproat,et al.  WordsEye: an automatic text-to-scene conversion system , 2001, SIGGRAPH.

[12]  Joan Condell,et al.  SceneMaker: Multimodal Visualisation of Natural Language Film Scripts , 2010, KES.

[13]  Stephen DiVerdi,et al.  Vremiere: In-Headset Virtual Reality Video Editing , 2017, CHI.

[14]  Vicente Ordonez,et al.  Text2Scene: Generating Compositional Scenes From Textual Descriptions , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Víctor H. Andaluz,et al.  Teaching Process for Children with Autism in Virtual Reality Environments , 2017, ICETC.

[16]  Peter Hall,et al.  A Survey of 3D Indoor Scene Synthesis , 2019, Journal of Computer Science and Technology.

[17]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  M. Võ,et al.  Reading scenes: how scene grammar guides attention and aids perception in real-world environments. , 2019, Current opinion in psychology.

[19]  Leonidas J. Guibas,et al.  PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Alessandro Rizzi,et al.  Semiotics of virtual reality as a communication process , 2016, Behav. Inf. Technol..

[21]  Won-Sook Lee,et al.  Visualizing Natural Language Descriptions , 2016, ACM Comput. Surv..

[22]  Jessica Rubart On Managing Spatial Hypermedia with Document Stores , 2019, HUMAN@HT.

[23]  Markus Funk,et al.  Using virtual reality for prototyping interactive architecture , 2017, MUM.

[24]  Jörg M. Haake,et al.  What's Eliza doing in the Chinese room? Incoherent hyperdocuments—and how to avoid them , 1991, HYPERTEXT '91.

[25]  James Pustejovsky,et al.  VoxML: A Visualization Modeling Language , 2016, LREC.

[26]  Katrin Dennerlein,et al.  Narratologie des Raumes , 2009 .

[27]  Thilo Götz,et al.  Design and implementation of the UIMA Common Analysis System , 2004, IBM Syst. J..

[28]  Frank M. Shipman,et al.  Parsing and interpreting ambiguous structures in spatial hypermedia , 2005, HYPERTEXT '05.

[29]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[30]  Angel X. Chang,et al.  Learning Spatial Knowledge for Text to 3D Scene Generation , 2014, EMNLP.

[31]  Alexander Mehler,et al.  Stolperwege: An App for a Digital Public History of the Holocaust , 2017, HT.

[32]  Thomas A. Funkhouser,et al.  Semantic Scene Completion from a Single Depth Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Alexander Mehler,et al.  resources2city Explorer: A System for Generating Interactive Walkable Virtual Cities out of File Systems , 2018, UIST.

[34]  Pat Hanrahan,et al.  Semantically-enriched 3D models for common-sense knowledge , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[35]  B. Kuehn,et al.  Virtual and Augmented Reality Put a Twist on Medical Education. , 2018, JAMA.

[36]  Jock D. Mackinlay,et al.  The information visualizer, an information workspace , 1991, CHI.

[37]  James Pustejovsky,et al.  Handbook of Linguistic Annotation , 2017 .

[38]  George G. Robertson,et al.  The WebBook and the Web Forager: an information workspace for the World-Wide Web , 1996, CHI.

[39]  Kraig Finstad,et al.  The Usability Metric for User Experience , 2010, Interact. Comput..

[40]  Nancy Ide,et al.  Bridging the Gaps: Interoperability for GrAF, GATE, and UIMA , 2009, Linguistic Annotation Workshop.

[41]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[42]  Angel X. Chang,et al.  SceneSeer: 3D Scene Design with Natural Language , 2017, ArXiv.

[43]  Dipti Misra Sharma,et al.  IIT(BHU)–IIITH at CoNLL–SIGMORPHON 2018 Shared Task on Universal Morphological Reinflection , 2018, CoNLL.