Classifier Data Quality: A Geometric Complexity Based Method for Automated Baseline And Insights Generation

Testing Machine Learning (ML) models and AI-Infused Applications (AIIAs), or systems that contain ML models, is highly challenging. In addition to the challenges of testing classical software, it is acceptable and expected that statistical ML models sometimes output incorrect results. A major challenge is to determine when the level of incorrectness, e.g., model accuracy or F1 score for classifiers, is acceptable and when it is not. In addition to business requirements that should provide a threshold, it is a best practice to require any proposed ML solution to out-perform simple baseline models, such as a decision tree. We have developed complexity measures, which quantify how difficult given observations are to assign to their true class label; these measures can then be used to automatically determine a baseline performance threshold. These measures are superior to the best practice baseline in that, for a linear computation cost, they also quantify each observation’ classification complexity in an explainable form, regardless of the classifier model used. Our experiments with both numeric synthetic data and real natural language chatbot data demonstrate that the complexity measures effectively highlight data regions and observations that are likely to be misclassfied.

[1]  Chris Thornton Separability is a Learner's Best Friend , 1997, NCPW.

[2]  Olivier Ledoit,et al.  A well-conditioned estimator for large-dimensional covariance matrices , 2004 .

[3]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[4]  Ateret Anaby-Tavor,et al.  Do Not Have Enough Data? Deep Learning to the Rescue! , 2020, AAAI.

[5]  Eitan Farchi,et al.  Bridging the gap between ML solutions and their business requirements using feature interactions , 2019, ESEC/SIGSOFT FSE.

[6]  Kibok Lee,et al.  A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , 2018, NeurIPS.

[7]  Shuyue Guan,et al.  A Distance-based Separability Measure for Internal Cluster Validation , 2021, Int. J. Artif. Intell. Tools.

[8]  Verena Rieser,et al.  Benchmarking Natural Language Understanding Services for building Conversational Agents , 2019, IWSDS.

[9]  Tshilidzi Marwala,et al.  A note on the separability index , 2008 .

[10]  Saiful Islam,et al.  Mahalanobis Distance , 2009, Encyclopedia of Biometrics.

[11]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[12]  Anna D. Peterson,et al.  A separability index for clustering and classification problems with applications to cluster merging and systematic evaluation of clustering algorithms , 2011 .

[13]  Typical Laws of Heredity , 1877, Nature.

[14]  Orna Raz,et al.  FreaAI: Automated extraction of data slices to test machine learning models , 2021, Communications in Computer and Information Science.

[15]  Naftali Tishby,et al.  Is Feature Selection Still Necessary? , 2005, SLSFS.

[16]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[17]  Benyamin Ghojogh,et al.  Linear and Quadratic Discriminant Analysis: Tutorial , 2019, ArXiv.