Analysing Data-To-Text Generation Benchmarks

A generation system can only be as good as the data it is trained on. In this short paper , we propose a methodology for analysing data-to-text corpora used for training micro-planner i.e., systems which given some input must produce a text verbalising exactly this input. We apply this methodology to three existing benchmarks and we elicite a set of criteria for the creation of a data-to-text benchmark which could help better support the development , evaluation and comparison of linguistically sophisticated data-to-text generators.

[1]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[2]  Oliver Lemon,et al.  Crowd-sourcing NLG Data: Pictures Elicit Better Data. , 2016, INLG.

[3]  Verena Rieser,et al.  The aNALoGuE Challenge: Non Aligned Language GEneration , 2016, INLG.

[4]  David Vandyke,et al.  Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems , 2015, EMNLP.

[5]  Anja Belz,et al.  The First Surface Realisation Shared Task: Overview and Evaluation Results , 2011, ENLG.

[6]  S. Young,et al.  Toward Multi-domain Language Generation using Recurrent Neural Networks , 2015 .

[7]  David Vandyke,et al.  Multi-domain Neural Network Language Generation for Spoken Dialogue Systems , 2016, NAACL.

[8]  Xiaofei Lu The Relationship of Lexical Richness to the Quality of ESL Learners' Oral Narratives. , 2012 .

[9]  David Grangier,et al.  Neural Text Generation from Structured Data with Application to the Biography Domain , 2016, EMNLP.

[10]  Dan Klein,et al.  Learning Semantic Correspondences with Less Supervision , 2009, ACL.

[11]  Xiaofei Lu,et al.  Automatic analysis of syntactic complexity in second language writing , 2010 .

[12]  M. Covington,et al.  HOW COMPLEX IS THAT SENTENCE? A PROPOSED REVISION OF THE ROSENBERG AND ABBEDUTO D-LEVEL SCALE , 2006 .

[13]  Raymond J. Mooney,et al.  Learning to sportscast: a test of grounded language acquisition , 2008, ICML '08.

[14]  Johan Bos,et al.  Developing a large semantically annotated corpus , 2012, LREC.

[15]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.