Beyond accuracy: Measures for assessing machine learning models, pitfalls and guidelines

Pattern recognition predictive models have become an important tool for analysis of neuroimaging data and answering important questions from clinical and cognitive neuroscience. Regardless of the application, the most commonly used method to quantify model performance is to calculate prediction accuracy, i.e. the proportion of correctly classified samples. While simple and intuitive, other performance measures are often more appropriate with respect to many common goals of neuroimaging pattern recognition studies. In this paper, we will review alternative performance measures and focus on their interpretation and practical aspects of model evaluation. Specifically, we will focus on 4 families of performance measures: 1) categorical performance measures such as accuracy, 2) rank based performance measures such as the area under the curve, 3) probabilistic performance measures based on quadratic error such as Brier score, and 4) probabilistic performance measures based on information criteria such as logarithmic score. We will examine their statistical properties in various settings using simulated data and real neuroimaging data derived from public datasets. Results showed that accuracy had the worst performance with respect to statistical power, detecting model improvement, selecting informative features and reliability of results. Therefore in most cases, it should not be used to make statistical inference about model performance. Accuracy should also be avoided for evaluating utility of clinical models, because it does not take into account clinically relevant information, such as relative cost of false-positive and false-negative misclassification or calibration of probabilistic predictions. We recommend alternative evaluation criteria with respect to the goals of a specific machine learning model.

[1]  N. Obuchowski,et al.  Assessing the Performance of Prediction Models: A Framework for Traditional and Novel Measures , 2010, Epidemiology.

[2]  John G. Csernansky,et al.  Open Access Series of Imaging Studies: Longitudinal MRI Data in Nondemented and Demented Older Adults , 2010, Journal of Cognitive Neuroscience.

[3]  Michael Eickenberg,et al.  Machine learning for neuroimaging with scikit-learn , 2014, Front. Neuroinform..

[4]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[5]  Asla Pitkänen,et al.  Quantitative MRI predicts long-term structural and functional outcome after experimental traumatic brain injury , 2009, NeuroImage.

[6]  John L. Kelly,et al.  A new interpretation of information rate , 1956, IRE Trans. Inf. Theory.

[7]  Axel Gandy Sequential Implementation of Monte Carlo Tests With Uniformly Bounded Resampling Risk , 2009 .

[8]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[9]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[10]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[11]  Arno Klein,et al.  101 Labeled Brain Images and a Consistent Human Cortical Labeling Protocol , 2012, Front. Neurosci..

[12]  Alfred Hamerle,et al.  Uses and Misuses of Measures for Credit Rating Accuracy , 2003 .

[13]  Gary S Collins,et al.  Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration , 2015, Annals of Internal Medicine.

[14]  Roger Newson,et al.  Parameters behind “Nonparametric” Statistics: Kendall's tau, Somers’ D and Median Differences , 2002 .

[15]  Andres Hoyos Idrobo,et al.  Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines , 2016, NeuroImage.

[16]  Tue Tjur,et al.  Coefficients of Determination in Logistic Regression Models—A New Proposal: The Coefficient of Discrimination , 2009 .

[17]  R. J. Graham,et al.  Joint Medium-Range Ensembles from The Met. Office and ECMWF Systems , 2000 .

[18]  N. Graham,et al.  Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation , 2002 .

[19]  Brian B. Avants,et al.  Registration based cortical thickness measurement , 2009, NeuroImage.

[20]  Leonard A. Smith,et al.  Evaluating Probabilistic Forecasts Using Information Theory , 2002 .

[21]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[22]  N. Nagelkerke,et al.  A note on a general definition of the coefficient of determination , 1991 .

[23]  Simon J. Mason,et al.  Understanding forecast verification statistics , 2008 .

[24]  J. Haynes Brain Reading: Decoding Mental States From Brain Activity In Humans , 2011 .

[25]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[26]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[27]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[28]  D. Seese,et al.  Algorithms for Spectral Analysis of Irregularly Sampled Time Series , 2004 .

[29]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[30]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[31]  Jack L. Gallant,et al.  Encoding and decoding in fMRI , 2011, NeuroImage.

[32]  Tom M. Mitchell,et al.  Machine learning classifiers and fMRI: A tutorial overview , 2009, NeuroImage.

[33]  Andrew Gelman,et al.  Measurement error and the replication crisis , 2017, Science.

[34]  R Cameron Craddock,et al.  Disease state prediction from resting state functional connectivity , 2009, Magnetic resonance in medicine.

[35]  C. Phillips,et al.  NeuroImage: Clinical , 2022 .