A Human-machine Collaborative Framework for Evaluating Malevolence in Dialogues

Conversational dialogue systems (CDSs) are hard to evaluate due to the complexity of natural language. Automatic evaluation of dialogues often shows insufficient correlation with human judgements. Human evaluation is reliable but labor-intensive. We introduce a humanmachine collaborative framework, HMCEval, that can guarantee reliability of the evaluation outcomes with reduced human effort. HMCEval casts dialogue evaluation as a sample assignment problem, where we need to decide to assign a sample to a human or a machine for evaluation. HMCEval includes a model confidence estimation module to estimate the confidence of the predicted sample assignment, and a human effort estimation module to estimate the human effort should the sample be assigned to human evaluation, as well as a sample assignment execution module that finds the optimum assignment solution based on the estimated confidence and effort. We assess the performance of HMCEval on the task of evaluating malevolence in dialogues. The experimental results show that HMCEval achieves around 99% evaluation accuracy with half of the human effort spared, showing that HMCEval provides reliable evaluation outcomes while reducing human effort by a large amount.

[1]  Nanyun Peng,et al.  Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems , 2020, AAAI.

[2]  Yi Pan,et al.  Conversational AI: The Science Behind the Alexa Prize , 2018, ArXiv.

[3]  Xu Sun,et al.  An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation , 2018, EMNLP.

[4]  C. V. Jawahar,et al.  Human-Machine Collaboration for Face Recognition , 2020, COMAD/CODS.

[5]  Nanyun Peng,et al.  Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[6]  Eduard Hovy,et al.  Manual and automatic evaluation of summaries , 2002, ACL 2002.

[7]  Walter S. Lasecki,et al.  Real-time captioning by groups of non-experts , 2012, UIST.

[8]  Matthieu Cord,et al.  Addressing Failure Prediction by Learning Model Confidence , 2019, NeurIPS.

[9]  Joelle Pineau,et al.  Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.

[10]  E. Valuations,et al.  A R EVIEW ON E VALUATION M ETRICS F OR D ATA C LASSIFICATION E VALUATIONS , 2015 .

[11]  M. de Rijke,et al.  A taxonomy, data set, and benchmark for detecting and classifying malevolent dialogue responses , 2021, J. Assoc. Inf. Sci. Technol..

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Maya R. Gupta,et al.  To Trust Or Not To Trust A Classifier , 2018, NeurIPS.

[14]  Charles Jochim,et al.  Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction , 2019, ACL.

[15]  Katsuro Inoue,et al.  Search-based software library recommendation using multi-objective optimization , 2017, Inf. Softw. Technol..

[16]  Jinho D. Choi,et al.  Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols , 2020, SIGDIAL.

[17]  J. Weston,et al.  Recipes for Safety in Open-domain Chatbots , 2020, ArXiv.

[18]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[19]  R. Marler,et al.  The weighted sum method for multi-objective optimization: new insights , 2010 .

[20]  Mihaela van der Schaar,et al.  MAMMO: A Deep Learning Solution for Facilitating Radiologist-Machine Collaboration in Breast Cancer Diagnosis , 2018, ArXiv.

[21]  Berkeley J. Dietvorst,et al.  Algorithm Aversion: People Erroneously Avoid Algorithms after Seeing Them Err , 2014, Journal of experimental psychology. General.

[22]  J. Mitchell Branch-and-Cut Algorithms for Combinatorial Optimization Problems , 1988 .

[23]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[24]  Fei Fang,et al.  Artificial Intelligence for Social Good: A Survey , 2020, ArXiv.

[25]  Jianfeng Gao,et al.  DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation , 2020, ACL.

[26]  Dongyan Zhao,et al.  RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.

[27]  Matthew Lease,et al.  Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content , 2020, HCOMP.

[28]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[29]  Y-Lan Boureau,et al.  Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , 2018, ACL.

[30]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[31]  Jens Lehmann,et al.  Language Model Transformers as Evaluators for Open-domain Dialogues , 2020, COLING.

[32]  Stanley F. Chen,et al.  Evaluation Metrics For Language Models , 1998 .

[33]  Erik Cambria,et al.  Augmenting End-to-End Dialogue Systems With Commonsense Knowledge , 2018, AAAI.

[34]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[35]  Fang Chen,et al.  Do I trust my machine teammate?: an investigation from perception to decision , 2019, IUI.

[36]  Percy Liang,et al.  The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.

[37]  Arantxa Otegi,et al.  Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.

[38]  Mary L. Cummings,et al.  Automation Bias in Intelligent Time Critical Decision Support Systems , 2004 .

[39]  Michael Gamon,et al.  Sentence-level MT evaluation without reference translations: beyond language modeling , 2005, EAMT.

[40]  P. Lachenbruch Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .

[41]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[42]  Allison Gates,et al.  The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews , 2020, BMC Medical Research Methodology.

[43]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[44]  Andrew Lim,et al.  MechanicalHeart: A Human-Machine Framework for the Classification of Phonocardiograms , 2018, Proc. ACM Hum. Comput. Interact..