Evaluating the WordsEye Text-to-Scene System: Imaginative and Realistic Sentences

We describe our evaluation of the WordsEye text-to-scene generation system. We address the problem of evaluating the output of such a system vs. simple search methods to find a picture to illustrate a sentence. To do this, we constructed two sets of test sentences: a set of crowdsourced imaginative sentences and a set of realistic sentences extracted from the PASCAL image caption corpus (Rashtchian et al., 2010). For each sentence, we compared sample pictures found using Google Image Search to those produced by WordsEye. We then crowdsourced judgments as to which picture best illustrated each sentence. For imaginative sentences, pictures produced by WordsEye were preferred, but for realistic sentences, Google Image Search results were preferred. We also used crowdsourcing to obtain a rating for how well each picture illustrated the sentence, from 1 (completely correct) to 5 (completely incorrect). WordsEye pictures had an average rating of 2.58 on imaginative sentences and 2.54 on realistic sentences; Google images had an average rating of 3.82 on imaginative sentences and 1.87 on realistic sentences. We also discuss the sources of errors in the WordsEye system.

[1]  Minhua Ma,et al.  Automatic Conversion of Natural Language to 3D Animation , 2006 .

[2]  Lijun Yin,et al.  Real-time automatic 3D scene generation from natural language voice and text descriptions , 2006, MM '06.

[3]  Christopher Potts,et al.  Text to 3D Scene Generation with Rich Lexical Grounding , 2015, ACL.

[4]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[5]  Sneha N. Dessai Text to 3 D Scene Generation , 2016 .

[6]  Rafael Radkowski,et al.  Ontology-driven Generation of 3D Animations for Training and Maintenance , 2007, 2007 International Conference on Multimedia and Ubiquitous Engineering (MUE'07).

[7]  Richard Sproat,et al.  WordsEye: an automatic text-to-scene conversion system , 2001, SIGGRAPH.

[8]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[9]  Won-Sook Lee,et al.  Visualizing Natural Language Descriptions , 2016, ACM Comput. Surv..

[10]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[11]  C. Lawrence Zitnick,et al.  Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Shaun Bangay,et al.  Automating the creation of 3D animation from annotated fiction text , 2008 .

[13]  Lucy Vanderwende,et al.  Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  Angel X. Chang,et al.  Semantic Parsing for Text to 3D Scene Generation , 2014, ACL 2014.

[15]  Quasim H. Mehdi,et al.  From visual semantic parameterization to graphic visualization , 2005, Ninth International Conference on Information Visualisation (IV'05).

[16]  Christopher K. I. Williams,et al.  Pascal Visual Object Classes Challenge Results , 2005 .