Evaluation methodologies in Automatic Question Generation 2013-2018

In the last few years Automatic Question Generation (AQG) has attracted increasing interest. In this paper we survey the evaluation methodologies used in AQG. Based on a sample of 37 papers, our research shows that the systems’ development has not been accompanied by similar developments in the methodologies used for the systems’ evaluation. Indeed, in the papers we examine here, we find a wide variety of both intrinsic and extrinsic evaluation methodologies. Such diverse evaluation practices make it difficult to reliably compare the quality of different generation systems. Our study suggests that, given the rapidly increasing level of research in the area, a common framework is urgently needed to compare the performance of AQG systems and NLG systems more generally.

[1]  Hannes Schulz,et al.  Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation , 2017, ArXiv.

[2]  Jianfeng Gao,et al.  deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.

[3]  Verena Rieser,et al.  RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.

[4]  Tomoko Kojiri,et al.  Automatic Question Generation for Educational Applications - The State of Art , 2014, ICCSAMA.

[5]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[6]  Anja Belz,et al.  An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.

[7]  Paul Piwek,et al.  Collecting Reliable Human Judgements on Machine-Generated Language: The Case of the QG-STEC Data , 2016, INLG.

[8]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[9]  Paul Piwek,et al.  The First Question Generation Shared Task Evaluation Challenge , 2010, Dialogue Discourse.

[10]  Sheetal Rakangor,et al.  Literature Review of Automatic Question Generation Systems , 2015 .

[11]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[12]  J Eriksson Lessons from a failure : Generating tailored smoking cessation letters , 2003 .

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[15]  Kristy Elizabeth Boyer,et al.  Varieties of Question Generation: Introduction to this Special Issue , 2012, Dialogue Discourse.

[16]  Ehud Reiter,et al.  A Structured Review of the Validity of BLEU , 2018, CL.

[17]  Chung Yong Lim,et al.  A Case Study on Inter-Annotator Agreement for Word Sense Disambiguation , 1999 .

[18]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[19]  Albert Gatt,et al.  Introducing Shared Tasks to NLG: The TUNA Shared Task Evaluation Challenges , 2010, Empirical Methods in Natural Language Generation.

[20]  Vasile Rus,et al.  A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.

[21]  Philip Bachman,et al.  Machine Comprehension by Text-to-Text Neural Question Generation , 2017, Rep4NLP@ACL.

[22]  Aljoscha Burchardt,et al.  Assessing Inter-Annotator Agreement for Translation Error Annotation , 2014 .

[23]  Dimitra Gkatzia,et al.  A Snapshot of NLG Evaluation Practices 2005 - 2014 , 2015, ENLG.

[24]  Petra Saskia Bayerl,et al.  What Determines Inter-Coder Agreement in Manual Annotations? A Meta-Analytic Investigation , 2011, CL.

[25]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[26]  Paul Piwek,et al.  Rethinking the Agreement in Human Evaluation Tasks , 2018, COLING.

[27]  Klaus Krippendorff,et al.  Computing Krippendorff's Alpha-Reliability , 2011 .