Dynamic Human Evaluation for Relative Model Comparisons
暂无分享,去创建一个
[1] Tal August,et al. All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text , 2021, ACL.
[2] Diyi Yang,et al. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.
[3] Dimitra Gkatzia,et al. Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions , 2020, INLG.
[4] Hadas Kotek,et al. Improving Human-Labeled Data through Dynamic Automatic Conflict Resolution , 2020, COLING.
[5] Lyle Ungar,et al. Item Response Theory for Efficient Human Evaluation of Chatbots , 2020, EVAL4NLP.
[6] Peter Henderson,et al. With Little Power Comes Great Responsibility , 2020, EMNLP.
[7] Elizabeth Clark,et al. Evaluation of Text Generation: A Survey , 2020, ArXiv.
[8] Ce Zhang,et al. Control, Generate, Augment: A Scalable Framework for Multi-Attribute Text Generation , 2020, FINDINGS.
[9] Albert Gatt,et al. Best practices for the human evaluation of automatically generated text , 2019, INLG.
[10] Samira Shaikh,et al. Towards Best Experiment Design for Evaluating Dialogue System Output , 2019, INLG.
[11] Ali Farhadi,et al. Defending Against Neural Fake News , 2019, NeurIPS.
[12] Percy Liang,et al. Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.
[13] Michael S. Bernstein,et al. HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models , 2019, NeurIPS.
[14] Arun Tejasvi Chaganty,et al. The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.
[15] Verena Rieser,et al. RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.
[16] Verena Rieser,et al. Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.
[17] Samy Bengio,et al. Generating Sentences from a Continuous Space , 2015, CoNLL.
[18] Adam J. Berinsky,et al. Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk , 2012, Political Analysis.
[19] Javier R. Movellan,et al. Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.
[20] Philipp Koehn,et al. (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.
[21] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[22] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[23] Jungo Kasai,et al. G ENIE A Leaderboard for Human-in-the-Loop Evaluation of Text Generation , 2021 .
[24] Jeroen B. P. Vuurens,et al. How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .
[25] N. Fisher,et al. Probability Inequalities for Sums of Bounded Random Variables , 1994 .