Evaluation and Improvement of Chatbot Text Classification Data Quality Using Plausible Negative Examples

We describe and validate a metric for estimating multi-class classifier performance based on cross-validation and adapted for improvement of small, unbalanced natural-language datasets used in chatbot design. Our experiences draw upon building recruitment chatbots that mediate communication between job-seekers and recruiters by exposing the ML/NLP dataset to the recruiting team. Evaluation approaches must be understandable to various stakeholders, and useful for improving chatbot performance. The metric, nex-cv, uses negative examples in the evaluation of text classification, and fulfils three requirements. First, it is actionable: it can be used by non-developer staff. Second, it is not overly optimistic compared to human ratings, making it a fast method for comparing classifiers. Third, it allows model-agnostic comparison, making it useful for comparing systems despite implementation differences. We validate the metric based on seven recruitment-domain datasets in English and German over the course of one year.

[1]  Amit Mishra,et al.  A survey on question answering systems with classification , 2016, J. King Saud Univ. Comput. Inf. Sci..

[2]  Allison Sauppé,et al.  Authoring and Verifying Human-Robot Interactions , 2018, UIST.

[3]  Meredith Ringel Morris,et al.  Crowdsourcing Similarity Judgments for Agreement Analysis in End-User Elicitation Studies , 2018, UIST.

[4]  John Woods,et al.  Survey on Chatbot Design Techniques in Speech Conversation Systems , 2015 .

[5]  Oscar Díaz,et al.  A quality analysis of facebook messenger's most popular chatbots , 2018, SAC.

[6]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[9]  James A. Landay,et al.  Examining Difficulties Software Developers Encounter in the Adoption of Statistical Machine Learning , 2008, AAAI.

[10]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[11]  Luigi De Russis,et al.  A Comparison and Critique of Natural Language Understanding Tools , 2018 .

[12]  Kit Kuksenok,et al.  Transparency in Maintenance of Recruitment Chatbots , 2019, ArXiv.

[13]  Karolina Kuligowska,et al.  Commercial Chatbot: Performance Evaluation, Usability Metrics and Quality Standards of Embodied Conversational Agents , 2015 .

[14]  Yang Wang,et al.  Manifold: A Model-Agnostic Framework for Interpretation and Diagnosis of Machine Learning Models , 2018, IEEE Transactions on Visualization and Computer Graphics.

[15]  Avelino J. Gonzalez,et al.  Towards a method for evaluating naturalness in conversational dialog systems , 2009, 2009 IEEE International Conference on Systems, Man and Cybernetics.

[16]  Joelle Pineau,et al.  A Survey of Available Corpora for Building Data-Driven Dialogue Systems , 2015, Dialogue Discourse.

[17]  Shwetak N. Patel,et al.  Evaluating and Informing the Design of Chatbots , 2018, Conference on Designing Interactive Systems.