ReproGen: Proposal for a Shared Task on Reproducibility of Human Evaluations in NLG

Across NLP, a growing body of work is looking at the issue of reproducibility. However, replicability of human evaluation experiments and reproducibility of their results is currently under-addressed, and this is of particular concern for NLG where human evaluations are the norm. This paper outlines our ideas for a shared task on reproducibility of human evaluations in NLG which aims (i) to shed light on the extent to which past NLG evaluations have been replicable and reproducible, and (ii) to draw conclusions regarding how evaluations can be designed and reported to increase replicability and reproducibility. If the task is run over several years, we hope to be able to document an overall increase in levels of replicability and reproducibility over time.

[1]  Anja Belz,et al.  An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.

[2]  Matthew Shardlow,et al.  CombiNMT: An Exploration into Neural Text Simplification Models , 2020, LREC.

[3]  Ehud Reiter,et al.  A Structured Review of the Validity of BLEU , 2018, CL.

[4]  Piek T. J. M. Vossen,et al.  A Shared Task of a New, Collaborative Type to Foster Reproducibility: A First Exercise in the Area of Language Science and Technology with REPROLANG2020 , 2020, LREC.

[5]  Edward Raff,et al.  A Step Toward Quantifying Independently Reproducible Machine Learning Research , 2019, NeurIPS.

[6]  K. Bretonnel Cohen,et al.  Community Perspective on Replicability in Natural Language Processing , 2019, RANLP.

[7]  Anja Belz,et al.  Comparing Rating Scales and Preference Judgements in Language Evaluation , 2010, INLG.

[8]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[9]  Albert Gatt,et al.  Best practices for the human evaluation of automatically generated text , 2019, INLG.

[10]  Dimitra Gkatzia,et al.  Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions , 2020, INLG.

[11]  Paul Piwek,et al.  Agreement is overrated: A plea for correlation to assess human evaluation reliability , 2019, INLG.

[12]  Joelle Pineau,et al.  Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program) , 2020, J. Mach. Learn. Res..

[13]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[14]  P. Hunter The reproducibility “crisis” , 2017, EMBO reports.

[15]  Simon Mille,et al.  Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing , 2020, INLG.