Item Response Theory for Efficient Human Evaluation of Chatbots

Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing high-quality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.

[1]  Thomas Hofmann,et al.  TrueSkill™: A Bayesian Skill Rating System , 2007 .

[2]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[3]  L. Thurstone A law of comparative judgment. , 1994 .

[4]  Mark Hopkins,et al.  Models of Translation Competitions , 2013, ACL.

[5]  Thorsten Dickhaus,et al.  Simultaneous Statistical Inference , 2014, Springer Berlin Heidelberg.

[6]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[7]  M. de Rijke,et al.  Dialogue Generation: From Imitation Learning to Inverse Reinforcement Learning , 2018, AAAI.

[8]  Manuela Cattelan,et al.  Models for Paired Comparison Data: A Review with Emphasis on Dependent Data , 2012, 1210.1016.

[9]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[10]  Alexander I. Rudnicky,et al.  A Dataset of Topic-Oriented Human-to-Chatbot Dialogues , 2018 .

[11]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[12]  Quoc V. Le,et al.  A Neural Conversational Model , 2015, ArXiv.

[13]  Jason Weston,et al.  ParlAI: A Dialog Research Software Platform , 2017, EMNLP.

[14]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS , 1952 .

[15]  F. Samejima Estimation of latent ability using a response pattern of graded scores , 1968 .

[16]  Dongyan Zhao,et al.  RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.

[17]  R. Hambleton,et al.  Item Response Theory , 1984, The History of Educational Measurement.

[18]  Mark Dras,et al.  Squibs: Evaluating Human Pairwise Preference Judgments , 2015, CL.

[19]  Jason Weston,et al.  ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons , 2019, ArXiv.

[20]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[21]  Daniel Jurafsky,et al.  Data Distillation for Controlling Specificity in Dialogue Generation , 2017, ArXiv.

[22]  Alberto Maydeu-Olivares,et al.  Item Response Modeling of Paired Comparison and Ranking Data , 2010, Multivariate behavioral research.

[23]  Chris Callison-Burch,et al.  ChatEval: A Tool for Chatbot Evaluation , 2019, NAACL.

[24]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[25]  Cristian Danescu-Niculescu-Mizil,et al.  Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.

[26]  Verena Rieser,et al.  RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.

[27]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[28]  Philipp Koehn Simulating human judgment in machine translation evaluation campaigns , 2012, IWSLT.

[29]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[30]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[31]  Matt Post,et al.  Efficient Elicitation of Annotations for Human Evaluation of Machine Translation , 2014, WMT@ACL.

[32]  D. Andrich A rating formulation for ordered response categories , 1978 .

[33]  Biao Wu,et al.  Automated Scoring of Chatbot Responses in Conversational Dialogue , 2018, IWSDS.

[34]  Jianfeng Gao,et al.  deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.

[35]  Mary Williamson,et al.  Recipes for Building an Open-Domain Chatbot , 2020, EACL.

[36]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[37]  Quoc V. Le,et al.  Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Hao Wu,et al.  Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds , 2019, EMNLP.

[40]  J. Guilford The Correlation of an Item With a Composite of the Remaining Items in a Test , 1953 .

[41]  Paul Piwek,et al.  Agreement is overrated: A plea for correlation to assess human evaluation reliability , 2019, INLG.

[42]  Daisuke Kawahara,et al.  IRT-based Aggregation Model of Crowdsourced Pairwise Comparison for Evaluating Machine Translations , 2016, EMNLP.

[43]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[44]  Kyunghyun Cho,et al.  Importance of Search and Evaluation Strategies in Neural Dialogue Modeling , 2018, INLG.

[45]  R. A. Bradley,et al.  RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS , 1952 .

[46]  Pascal Poupart,et al.  Deep Active Learning for Dialogue Generation , 2016, *SEMEVAL.

[47]  Hannah R Rothstein,et al.  A basic introduction to fixed‐effect and random‐effects models for meta‐analysis , 2010, Research synthesis methods.

[48]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[49]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[50]  Jason Weston,et al.  Importance of a Search Strategy in Neural Dialogue Modelling , 2018, ArXiv.

[51]  W. Bruce Croft,et al.  Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2013 .

[52]  M. R. Novick,et al.  Statistical Theories of Mental Test Scores. , 1971 .

[53]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[54]  Ming-Wei Chang,et al.  A Knowledge-Grounded Neural Conversation Model , 2017, AAAI.

[55]  Hao Wu,et al.  Building an Evaluation Scale using Item Response Theory , 2016, EMNLP.

[56]  Alan Ritter,et al.  Unsupervised Modeling of Twitter Conversations , 2010, NAACL.

[57]  Tom Minka,et al.  TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.

[58]  Christine E. DeMars,et al.  Item Response Theory , 2010, Assessing Measurement Invariance for Applied Research.

[59]  Alan Ritter,et al.  Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints , 2018, EMNLP.

[60]  Jianfeng Gao,et al.  DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation , 2020, ACL.