论文信息 - Evaluation and Improvement of Chatbot Text Classification Data Quality Using Plausible Negative Examples

Evaluation and Improvement of Chatbot Text Classification Data Quality Using Plausible Negative Examples

We describe and validate a metric for estimating multi-class classifier performance based on cross-validation and adapted for improvement of small, unbalanced natural-language datasets used in chatbot design. Our experiences draw upon building recruitment chatbots that mediate communication between job-seekers and recruiters by exposing the ML/NLP dataset to the recruiting team. Evaluation approaches must be understandable to various stakeholders, and useful for improving chatbot performance. The metric, nex-cv, uses negative examples in the evaluation of text classification, and fulfils three requirements. First, it is actionable: it can be used by non-developer staff. Second, it is not overly optimistic compared to human ratings, making it a fast method for comparing classifiers. Third, it allows model-agnostic comparison, making it useful for comparing systems despite implementation differences. We validate the metric based on seven recruitment-domain datasets in English and German over the course of one year.

Kit Kuksenok | Andriy Martyniv | Kit Kuksenok | Andriy Martyniv

[1] Amit Mishra,et al. A survey on question answering systems with classification , 2016, J. King Saud Univ. Comput. Inf. Sci..

[2] Allison Sauppé,et al. Authoring and Verifying Human-Robot Interactions , 2018, UIST.

[3] Meredith Ringel Morris,et al. Crowdsourcing Similarity Judgments for Agreement Analysis in End-User Elicitation Studies , 2018, UIST.

[4] John Woods,et al. Survey on Chatbot Design Techniques in Speech Conversation Systems , 2015 .

[5] Oscar Díaz,et al. A quality analysis of facebook messenger's most popular chatbots , 2018, SAC.

[6] Sanjeev Arora,et al. A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[7] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[9] James A. Landay,et al. Examining Difficulties Software Developers Encounter in the Adoption of Statistical Machine Learning , 2008, AAAI.

[10] Carlos Guestrin,et al. "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[11] Luigi De Russis,et al. A Comparison and Critique of Natural Language Understanding Tools , 2018 .

[12] Kit Kuksenok,et al. Transparency in Maintenance of Recruitment Chatbots , 2019, ArXiv.

[13] Karolina Kuligowska,et al. Commercial Chatbot: Performance Evaluation, Usability Metrics and Quality Standards of Embodied Conversational Agents , 2015 .

[14] Yang Wang,et al. Manifold: A Model-Agnostic Framework for Interpretation and Diagnosis of Machine Learning Models , 2018, IEEE Transactions on Visualization and Computer Graphics.

[15] Avelino J. Gonzalez,et al. Towards a method for evaluating naturalness in conversational dialog systems , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.

[16] Joelle Pineau,et al. A Survey of Available Corpora for Building Data-Driven Dialogue Systems , 2015, Dialogue Discourse.

[17] Shwetak N. Patel,et al. Evaluating and Informing the Design of Chatbots , 2018, Conference on Designing Interactive Systems.