A Human-machine Collaborative Framework for Evaluating Malevolence in Dialogues
暂无分享,去创建一个
[1] Nanyun Peng,et al. Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems , 2020, AAAI.
[2] Yi Pan,et al. Conversational AI: The Science Behind the Alexa Prize , 2018, ArXiv.
[3] Xu Sun,et al. An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation , 2018, EMNLP.
[4] C. V. Jawahar,et al. Human-Machine Collaboration for Face Recognition , 2020, COMAD/CODS.
[5] Nanyun Peng,et al. Better Automatic Evaluation of Open-Domain Dialogue Systems with Contextualized Embeddings , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.
[6] Eduard Hovy,et al. Manual and automatic evaluation of summaries , 2002, ACL 2002.
[7] Walter S. Lasecki,et al. Real-time captioning by groups of non-experts , 2012, UIST.
[8] Matthieu Cord,et al. Addressing Failure Prediction by Learning Model Confidence , 2019, NeurIPS.
[9] Joelle Pineau,et al. Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses , 2017, ACL.
[10] E. Valuations,et al. A R EVIEW ON E VALUATION M ETRICS F OR D ATA C LASSIFICATION E VALUATIONS , 2015 .
[11] M. de Rijke,et al. A taxonomy, data set, and benchmark for detecting and classifying malevolent dialogue responses , 2021, J. Assoc. Inf. Sci. Technol..
[12] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[13] Maya R. Gupta,et al. To Trust Or Not To Trust A Classifier , 2018, NeurIPS.
[14] Charles Jochim,et al. Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction , 2019, ACL.
[15] Katsuro Inoue,et al. Search-based software library recommendation using multi-objective optimization , 2017, Inf. Softw. Technol..
[16] Jinho D. Choi,et al. Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols , 2020, SIGDIAL.
[17] J. Weston,et al. Recipes for Safety in Open-domain Chatbots , 2020, ArXiv.
[18] Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.
[19] R. Marler,et al. The weighted sum method for multi-objective optimization: new insights , 2010 .
[20] Mihaela van der Schaar,et al. MAMMO: A Deep Learning Solution for Facilitating Radiologist-Machine Collaboration in Breast Cancer Diagnosis , 2018, ArXiv.
[21] Berkeley J. Dietvorst,et al. Algorithm Aversion: People Erroneously Avoid Algorithms after Seeing Them Err , 2014, Journal of experimental psychology. General.
[22] J. Mitchell. Branch-and-Cut Algorithms for Combinatorial Optimization Problems , 1988 .
[23] Verena Rieser,et al. Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.
[24] Fei Fang,et al. Artificial Intelligence for Social Good: A Survey , 2020, ArXiv.
[25] Jianfeng Gao,et al. DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation , 2020, ACL.
[26] Dongyan Zhao,et al. RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems , 2017, AAAI.
[27] Matthew Lease,et al. Fast, Accurate, and Healthier: Interactive Blurring Helps Moderators Reduce Exposure to Harmful Content , 2020, HCOMP.
[28] R. P. Fishburne,et al. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .
[29] Y-Lan Boureau,et al. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset , 2018, ACL.
[30] Andy Liaw,et al. Classification and Regression by randomForest , 2007 .
[31] Jens Lehmann,et al. Language Model Transformers as Evaluators for Open-domain Dialogues , 2020, COLING.
[32] Stanley F. Chen,et al. Evaluation Metrics For Language Models , 1998 .
[33] Erik Cambria,et al. Augmenting End-to-End Dialogue Systems With Commonsense Knowledge , 2018, AAAI.
[34] Joelle Pineau,et al. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.
[35] Fang Chen,et al. Do I trust my machine teammate?: an investigation from perception to decision , 2019, IUI.
[36] Percy Liang,et al. The price of debiasing automatic metrics in natural language evalaution , 2018, ACL.
[37] Arantxa Otegi,et al. Survey on evaluation methods for dialogue systems , 2019, Artificial Intelligence Review.
[38] Mary L. Cummings,et al. Automation Bias in Intelligent Time Critical Decision Support Systems , 2004 .
[39] Michael Gamon,et al. Sentence-level MT evaluation without reference translations: beyond language modeling , 2005, EAMT.
[40] P. Lachenbruch. Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .
[41] Jianfeng Gao,et al. A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.
[42] Allison Gates,et al. The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews , 2020, BMC Medical Research Methodology.
[43] Kenneth Steiglitz,et al. Combinatorial Optimization: Algorithms and Complexity , 1981 .
[44] Andrew Lim,et al. MechanicalHeart: A Human-Machine Framework for the Classification of Phonocardiograms , 2018, Proc. ACM Hum. Comput. Interact..