A Model-Agnostic Algorithm for Bayes Error Determination in Binary Classification

This paper presents the intrinsic limit determination algorithm (ILD Algorithm), a novel technique to determine the best possible performance, measured in terms of the AUC (area under the ROC curve) and accuracy, that can be obtained from a specific dataset in a binary classification problem with categorical features regardless of the model used. This limit, namely, the Bayes error, is completely independent of any model used and describes an intrinsic property of the dataset. The ILD algorithm thus provides important information regarding the prediction limits of any binary classification algorithm when applied to the considered dataset. In this paper, the algorithm is described in detail, its entire mathematical framework is presented and the pseudocode is given to facilitate its implementation. Finally, an example with a real dataset is given.

[1]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[2]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[3]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[4]  P. Alam,et al.  R , 1823, The Herodotus Encyclopedia.

[5]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[6]  D. Levy,et al.  Prediction of coronary heart disease using risk factor categories. , 1998, Circulation.

[7]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[8]  Kipp W. Johnson,et al.  Machine Learning to Predict Mortality and Critical Events in a Cohort of Patients With COVID-19 in New York City: Model Development and Validation , 2020, Journal of Medical Internet Research.

[9]  M. Kendall Probability and Statistical Inference , 1956, Nature.

[10]  Hong Zhu,et al.  Hyper-Parameter Optimization: A Review of Algorithms and Applications , 2020, ArXiv.

[11]  Umberto Michelucci,et al.  Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks , 2018 .

[12]  Kagan Tumer,et al.  Bayes Error Rate Estimation Using Classifier Ensembles , 2003 .

[13]  Kagan Tumer,et al.  A mutual information based ensemble method to estimate Bayes error , 1998 .

[14]  J. C. Schlimmer,et al.  Incremental learning from noisy data , 2004, Machine Learning.

[15]  Bhavani Raskutti,et al.  Optimising area under the ROC curve using gradient descent , 2004, ICML.

[16]  Hyung-Jun Kim,et al.  An Easy-to-Use Machine Learning Model to Predict the Prognosis of Patients With COVID-19: Retrospective Cohort Study , 2020, Journal of Medical Internet Research.

[17]  Fariha Sohil,et al.  An introduction to statistical learning with applications in R , 2021, Statistical Theory and Related Fields.

[18]  Daniel Levy,et al.  The Framingham Heart Study and the epidemiology of cardiovascular disease: a historical perspective , 2014, The Lancet.

[19]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[20]  M. Pencina,et al.  General Cardiovascular Risk Profile for Use in Primary Care: The Framingham Heart Study , 2008, Circulation.

[21]  Yang Yu,et al.  A novel density-based adaptive k nearest neighbor method for dealing with overlapping problem in imbalanced datasets , 2020, Neural Computing and Applications.

[22]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[23]  Moon Hyun Jae,et al.  A machine learning–based 1-year mortality prediction model after hospital discharge for clinical patients with acute coronary syndrome , 2019, Health Informatics J..

[24]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[25]  Umberto Michelucci,et al.  Estimating Neural Network's Performance with Bootstrap: A Tutorial , 2021, Mach. Learn. Knowl. Extr..

[26]  Geoffrey I. Webb,et al.  Model Evaluation , 2017, Encyclopedia of Machine Learning and Data Mining.

[27]  Joydeep Ghosh,et al.  Multiclassifier Systems: Back to the Future , 2002, Multiple Classifier Systems.

[28]  S. P. Akpabio World Health Organisation , 1983, British Dental Journal.

[29]  Yiye Zhang,et al.  Using Electronic Health Records and Machine Learning to Predict Postpartum Depression , 2019, MedInfo.

[30]  Sebastian Raschka,et al.  Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning , 2018, ArXiv.

[31]  W. Gibson,et al.  Machine learning versus traditional risk stratification methods in acute coronary syndrome: a pooled randomized clinical trial analysis , 2019, Journal of Thrombosis and Thrombolysis.

[32]  Randy L. Shimabukuro,et al.  Least-Squares Learning and Approximation of Posterior Probabilities on Classification Problems by Neural Network Models , 1991 .

[33]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.