Evaluation methodologies in Automatic Question Generation 2013-2018
暂无分享,去创建一个
[1] Hannes Schulz,et al. Relevance of Unsupervised Metrics in Task-Oriented Dialogue for Evaluating Natural Language Generation , 2017, ArXiv.
[2] Jianfeng Gao,et al. deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.
[3] Verena Rieser,et al. RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.
[4] Tomoko Kojiri,et al. Automatic Question Generation for Educational Applications - The State of Art , 2014, ICCSAMA.
[5] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[6] Anja Belz,et al. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.
[7] Paul Piwek,et al. Collecting Reliable Human Judgements on Machine-Generated Language: The Case of the QG-STEC Data , 2016, INLG.
[8] Klaus Krippendorff,et al. Content Analysis: An Introduction to Its Methodology , 1980 .
[9] Paul Piwek,et al. The First Question Generation Shared Task Evaluation Challenge , 2010, Dialogue Discourse.
[10] Sheetal Rakangor,et al. Literature Review of Automatic Question Generation Systems , 2015 .
[11] Jean Carletta,et al. Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.
[12] J Eriksson. Lessons from a failure : Generating tailored smoking cessation letters , 2003 .
[13] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[14] Chin-Yew Lin,et al. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.
[15] Kristy Elizabeth Boyer,et al. Varieties of Question Generation: Introduction to this Special Issue , 2012, Dialogue Discourse.
[16] Ehud Reiter,et al. A Structured Review of the Validity of BLEU , 2018, CL.
[17] Chung Yong Lim,et al. A Case Study on Inter-Annotator Agreement for Word Sense Disambiguation , 1999 .
[18] Ron Artstein,et al. Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.
[19] Albert Gatt,et al. Introducing Shared Tasks to NLG: The TUNA Shared Task Evaluation Challenges , 2010, Empirical Methods in Natural Language Generation.
[20] Vasile Rus,et al. A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics , 2012, BEA@NAACL-HLT.
[21] Philip Bachman,et al. Machine Comprehension by Text-to-Text Neural Question Generation , 2017, Rep4NLP@ACL.
[22] Aljoscha Burchardt,et al. Assessing Inter-Annotator Agreement for Translation Error Annotation , 2014 .
[23] Dimitra Gkatzia,et al. A Snapshot of NLG Evaluation Practices 2005 - 2014 , 2015, ENLG.
[24] Petra Saskia Bayerl,et al. What Determines Inter-Coder Agreement in Manual Annotations? A Meta-Analytic Investigation , 2011, CL.
[25] Emiel Krahmer,et al. Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..
[26] Paul Piwek,et al. Rethinking the Agreement in Human Evaluation Tasks , 2018, COLING.
[27] Klaus Krippendorff,et al. Computing Krippendorff's Alpha-Reliability , 2011 .