Understanding the Impact of Experiment Design for Evaluating Dialogue System Output

Evaluation of output from natural language generation (NLG) systems is typically conducted via crowdsourced human judgments. To understand the impact of how experiment design might affect the quality and consistency of such human judgments, we designed a between-subjects study with four experimental conditions. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as no prior experience of participating in similar studies of rating dialogue system output

[1]  A. Sorace,et al.  MAGNITUDE ESTIMATION OF LINGUISTIC ACCEPTABILITY , 1996 .

[2]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[3]  Verena Rieser,et al.  RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.

[4]  Jesse Hoey,et al.  Affective Neural Response Generation , 2017, ECIR.

[5]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[6]  Susan T. Fiske,et al.  Scientists Making a Difference: One Hundred Eminent Behavioral and Brain Scientists Talk about Their Most Important Contributions , 2016 .

[7]  Saif Mohammad,et al.  Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation , 2017, ACL.

[8]  Anja Belz,et al.  Comparing Rating Scales and Preference Judgements in Language Evaluation , 2010, INLG.

[9]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[10]  Osmar R. Zaïane,et al.  Augmenting Neural Response Generation with Context-Aware Topical Attention , 2018, Proceedings of the First Workshop on NLP for Conversational AI.

[11]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[12]  Anja Belz,et al.  Discrete vs. Continuous Rating Scales for Language Evaluation in NLP , 2011, ACL.

[13]  Rahul Goel,et al.  On Evaluating and Comparing Conversational Agents , 2018, ArXiv.