Dynamic Human Evaluation for Relative Model Comparisons

Collecting human judgements is currently the most reliable evaluation method for natural language generation systems. Automatic metrics have reported flaws when applied to measure quality aspects of generated text and have been shown to correlate poorly with human judgements. However, human evaluation is time and cost-intensive, and we lack consensus on designing and conducting human evaluation experiments. Thus there is a need for streamlined approaches for efficient collection of human judgements when evaluating natural language generation systems. Therefore, we present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study. The main results indicate that a decision about the superior model can be made with high probability across different labelling strategies, where assigning a single random worker per task requires the least overall labelling effort and thus the least cost.

[1]  Tal August,et al.  All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text , 2021, ACL.

[2]  Diyi Yang,et al.  The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics , 2021, GEM.

[3]  Dimitra Gkatzia,et al.  Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions , 2020, INLG.

[4]  Hadas Kotek,et al.  Improving Human-Labeled Data through Dynamic Automatic Conflict Resolution , 2020, COLING.

[5]  Lyle Ungar,et al.  Item Response Theory for Efficient Human Evaluation of Chatbots , 2020, EVAL4NLP.

[6]  Peter Henderson,et al.  With Little Power Comes Great Responsibility , 2020, EMNLP.

[7]  Elizabeth Clark,et al.  Evaluation of Text Generation: A Survey , 2020, ArXiv.

[8]  Ce Zhang,et al.  Control, Generate, Augment: A Scalable Framework for Multi-Attribute Text Generation , 2020, FINDINGS.

[9]  Albert Gatt,et al.  Best practices for the human evaluation of automatically generated text , 2019, INLG.

[10]  Samira Shaikh,et al.  Towards Best Experiment Design for Evaluating Dialogue System Output , 2019, INLG.

[11]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[12]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[13]  Michael S. Bernstein,et al.  HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models , 2019, NeurIPS.

[14]  Arun Tejasvi Chaganty,et al.  The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.

[15]  Verena Rieser,et al.  RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.

[16]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[17]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[18]  Adam J. Berinsky,et al.  Evaluating Online Labor Markets for Experimental Research: Amazon.com's Mechanical Turk , 2012, Political Analysis.

[19]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[20]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[21]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[22]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23]  Jungo Kasai,et al.  G ENIE A Leaderboard for Human-in-the-Loop Evaluation of Text Generation , 2021 .

[24]  Jeroen B. P. Vuurens,et al.  How Much Spam Can You Take? An Analysis of Crowdsourcing Results to Increase Accuracy , 2011 .

[25]  N. Fisher,et al.  Probability Inequalities for Sums of Bounded Random Variables , 1994 .