Item Response Theory for Efficient Human Evaluation of Chatbots
暂无分享,去创建一个
[1] Thomas Hofmann,et al. TrueSkill™: A Bayesian Skill Rating System , 2007 .
[2] Alexander M. Rush,et al. OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.
[3] L. Thurstone. A law of comparative judgment. , 1994 .
[4] Mark Hopkins,et al. Models of Translation Competitions , 2013, ACL.
[5] Thorsten Dickhaus,et al. Simultaneous Statistical Inference , 2014, Springer Berlin Heidelberg.
[6] J. Fleiss. Measuring nominal scale agreement among many raters. , 1971 .
[7] M. de Rijke,et al. Dialogue Generation: From Imitation Learning to Inverse Reinforcement Learning , 2018, AAAI.
[8] Manuela Cattelan,et al. Models for Paired Comparison Data: A Review with Emphasis on Dependent Data , 2012, 1210.1016.
[9] Philipp Koehn,et al. Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.
[10] Alexander I. Rudnicky,et al. A Dataset of Topic-Oriented Human-to-Chatbot Dialogues , 2018 .
[11] Verena Rieser,et al. Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.
[12] Quoc V. Le,et al. A Neural Conversational Model , 2015, ArXiv.
[13] Jason Weston,et al. ParlAI: A Dialog Research Software Platform , 2017, EMNLP.
[14] R. A. Bradley,et al. RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS THE METHOD OF PAIRED COMPARISONS , 1952 .
[15] F. Samejima. Estimation of latent ability using a response pattern of graded scores , 1968 .
[16] Dongyan Zhao,et al. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.
[17] R. Hambleton,et al. Item Response Theory , 1984, The History of Educational Measurement.
[18] Mark Dras,et al. Squibs: Evaluating Human Pairwise Preference Judgments , 2015, CL.
[19] Jason Weston,et al. ACUTE-EVAL: Improved Dialogue Evaluation with Optimized Questions and Multi-turn Comparisons , 2019, ArXiv.
[20] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.
[21] Daniel Jurafsky,et al. Data Distillation for Controlling Specificity in Dialogue Generation , 2017, ArXiv.
[22] Alberto Maydeu-Olivares,et al. Item Response Modeling of Paired Comparison and Ranking Data , 2010, Multivariate behavioral research.
[23] Chris Callison-Burch,et al. ChatEval: A Tool for Chatbot Evaluation , 2019, NAACL.
[24] Jianfeng Gao,et al. A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.
[25] Cristian Danescu-Niculescu-Mizil,et al. Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.
[26] Verena Rieser,et al. RankME: Reliable Human Ratings for Natural Language Generation , 2018, NAACL.
[27] Joelle Pineau,et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.
[28] Philipp Koehn. Simulating human judgment in machine translation evaluation campaigns , 2012, IWSLT.
[29] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[30] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.
[31] Matt Post,et al. Efficient Elicitation of Annotations for Human Evaluation of Machine Translation , 2014, WMT@ACL.
[32] D. Andrich. A rating formulation for ordered response categories , 1978 .
[33] Biao Wu,et al. Automated Scoring of Chatbot Responses in Conversational Dialogue , 2018, IWSDS.
[34] Jianfeng Gao,et al. deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets , 2015, ACL.
[35] Mary Williamson,et al. Recipes for Building an Open-Domain Chatbot , 2020, EACL.
[36] Joelle Pineau,et al. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.
[37] Quoc V. Le,et al. Towards a Human-like Open-Domain Chatbot , 2020, ArXiv.
[38] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[39] Hao Wu,et al. Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds , 2019, EMNLP.
[40] J. Guilford. The Correlation of an Item With a Composite of the Remaining Items in a Test , 1953 .
[41] Paul Piwek,et al. Agreement is overrated: A plea for correlation to assess human evaluation reliability , 2019, INLG.
[42] Daisuke Kawahara,et al. IRT-based Aggregation Model of Crowdsourced Pairwise Comparison for Evaluating Machine Translations , 2016, EMNLP.
[43] Hang Li,et al. Neural Responding Machine for Short-Text Conversation , 2015, ACL.
[44] Kyunghyun Cho,et al. Importance of Search and Evaluation Strategies in Neural Dialogue Modeling , 2018, INLG.
[45] R. A. Bradley,et al. RANK ANALYSIS OF INCOMPLETE BLOCK DESIGNS , 1952 .
[46] Pascal Poupart,et al. Deep Active Learning for Dialogue Generation , 2016, *SEMEVAL.
[47] Hannah R Rothstein,et al. A basic introduction to fixed‐effect and random‐effects models for meta‐analysis , 2010, Research synthesis methods.
[48] Philipp Koehn,et al. (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.
[49] Rotem Dror,et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.
[50] Jason Weston,et al. Importance of a Search Strategy in Neural Dialogue Modelling , 2018, ArXiv.
[51] W. Bruce Croft,et al. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2013 .
[52] M. R. Novick,et al. Statistical Theories of Mental Test Scores. , 1971 .
[53] Joelle Pineau,et al. A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.
[54] Ming-Wei Chang,et al. A Knowledge-Grounded Neural Conversation Model , 2017, AAAI.
[55] Hao Wu,et al. Building an Evaluation Scale using Item Response Theory , 2016, EMNLP.
[56] Alan Ritter,et al. Unsupervised Modeling of Twitter Conversations , 2010, NAACL.
[57] Tom Minka,et al. TrueSkillTM: A Bayesian Skill Rating System , 2006, NIPS.
[58] Christine E. DeMars,et al. Item Response Theory , 2010, Assessing Measurement Invariance for Applied Research.
[59] Alan Ritter,et al. Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints , 2018, EMNLP.
[60] Jianfeng Gao,et al. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation , 2020, ACL.