Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications

Machine learning methods are applied to three different large datasets, all dealing with probability estimation problems for dichotomous or multicategory data. Specifically, we investigate k-nearest neighbors, bagged nearest neighbors, random forests for probability estimation trees, and support vector machines with the kernels of Bessel, linear, Laplacian, and radial basis type. Comparisons are made with logistic regression. The dataset from the German Stroke Study Collaboration with dichotomous and three-category outcome variables allows, in particular, for temporal and external validation. The other two datasets are freely available from the UCI learning repository and provide dichotomous outcome variables. One of them, the Cleveland Clinic Foundation Heart Disease dataset, uses data from one clinic for training and from three clinics for external validation, while the other, the thyroid disease dataset, allows for temporal validation by separating data into training and test data by date of recruitment into study. For dichotomous outcome variables, we use receiver operating characteristics, areas under the curve values with bootstrapped 95% confidence intervals, and Hosmer-Lemeshow-type figures as comparison criteria. For dichotomous and multicategory outcomes, we calculated bootstrap Brier scores with 95% confidence intervals and also compared them through bootstrapping. In a supplement, we provide R code for performing the analyses and for random forest analyses in Random Jungle, version 2.1.0. The learning machines show promising performance over all constructed models. They are simple to apply and serve as an alternative approach to logistic or multinomial logistic regression analysis.

[1]  P. Hall,et al.  Properties of bagged nearest neighbour classifiers , 2005 .

[2]  Min Tsao,et al.  New Intervals for the Difference Between Two Independent Binomial Proportions , 2004 .

[3]  Gengsheng Qin,et al.  A new confidence interval for the difference between two binomial proportions of paired data , 2005 .

[4]  Mehryar Mohri,et al.  Confidence Intervals for the Area Under the ROC Curve , 2004, NIPS.

[5]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[6]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[7]  R. Detrano,et al.  Bayesian probability analysis: a prospective demonstration of its clinical utility in diagnosing coronary disease. , 1984, Circulation.

[8]  Carolin Strobl,et al.  The behaviour of random forest permutation-based variable importance measures under predictor correlation , 2010, BMC Bioinformatics.

[9]  Christopher A. T. Ferro,et al.  Comparing Probabilistic Forecasting Systems with the Brier Score , 2007 .

[10]  I R König,et al.  Patient-centered yes/no prognosis using learning machines , 2008, Int. J. Data Min. Bioinform..

[11]  Mark A. Liniger,et al.  The discrete brier and ranked probability skill scores , 2007 .

[12]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[13]  J. Marron,et al.  Bidirectional discrimination with application to data visualization. , 2012, Biometrika.

[14]  Thomas A Gerds,et al.  A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index , 2014, Statistics in medicine.

[15]  Yi Lin,et al.  Support Vector Machines and the Bayes Rule in Classification , 2002, Data Mining and Knowledge Discovery.

[16]  C E Minder,et al.  On Graphically Checking Goodness-of-fit of Binary Logistic Regression Models. , 2009, Methods of information in medicine.

[17]  Paul Compton,et al.  Inductive knowledge acquisition: a case study , 1987 .

[18]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[19]  Antonia Zapf,et al.  Difference of two dependent sensitivities and specificities: Comparison of various approaches. , 2013, Biometrical journal. Biometrische Zeitschrift.

[20]  R. Detrano,et al.  International application of a new probability algorithm for the diagnosis of coronary artery disease. , 1989, The American journal of cardiology.

[21]  Con Connell,et al.  Applications of Expert Systems , 1989 .

[22]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[23]  Andreas Ziegler,et al.  Risk estimation and risk prediction using machine-learning methods , 2012, Human Genetics.

[24]  K Yamauchi,et al.  Application of Resampling Techniques to the Statistical Analysis of the Brier Score , 2001, Methods of Information in Medicine.

[25]  Gengsheng Qin,et al.  A supplement to: "A new confidence interval for the difference between two binomial proportions of paired data" , 2007 .

[26]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[27]  T. Tango,et al.  Confidence intervals for differnces in correlated binary proportions by W. L. May and W. D. Johnson, Statistics in Medicine, 16, 2127–2136 (1997) , 2000 .

[28]  D A Redelmeier,et al.  Assessing predictive accuracy: how to compare Brier scores. , 1991, Journal of clinical epidemiology.

[29]  Allan Donner,et al.  A simple alternative confidence interval for the difference between two proportions. , 2004, Controlled clinical trials.

[30]  D J Spiegelhalter,et al.  Probabilistic prediction in patient management and clinical trials. , 1986, Statistics in medicine.

[31]  A Ziegler,et al.  EDITOR Comments on ‘Practical experiences on the necessity of external validation’ , 2008 .

[32]  R. Newcombe,et al.  Interval estimation for the difference between independent proportions: comparison of eleven methods. , 1998, Statistics in medicine.

[33]  Hatem Alkadhi,et al.  A clinical prediction rule for the diagnosis of coronary artery disease: validation, updating, and extension. , 2011, European heart journal.

[34]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[35]  Mahoney Fi,et al.  FUNCTIONAL EVALUATION: THE BARTHEL INDEX. , 1965 .

[36]  Tempei Hashino,et al.  Sampling Uncertainty and Confidence Intervals for the Brier Score and Brier Skill Score , 2008 .

[37]  T Tango Confidence intervals for differences in correlated binary proportions. , 2000, Statistics in medicine.

[38]  I. König,et al.  Predicting functional outcome and survival after acute ischemic stroke , 2002, Journal of Neurology.

[39]  R G Newcombe,et al.  Improved confidence intervals for the difference between binomial proportions based on paired data. , 1998, Statistics in medicine.

[40]  J. D. Malley,et al.  Probability Machines , 2011, Methods of Information in Medicine.

[41]  R. Samworth Optimal weighted nearest neighbour classifiers , 2011, 1101.5783.

[42]  Vikas Sindhwani,et al.  Information Theoretic Feature Crediting in Multiclass Support Vector Machines , 2001, SDM.

[43]  Carlotta Domeniconi,et al.  Nearest neighbor ensemble , 2004, ICPR 2004.

[44]  Kurt Hornik,et al.  Support Vector Machines in R , 2006 .

[45]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[46]  R. Newcombe Two-sided confidence intervals for the single proportion: comparison of seven methods. , 1998, Statistics in medicine.

[47]  James D. Malley,et al.  Statistical Learning for Biomedical Data: Preface , 2011 .

[48]  A. Ziegler,et al.  Age and National Institutes of Health Stroke Scale Score Within 6 Hours After Onset Are Accurate Predictors of Outcome After Cerebral Ischemia: Development and External Validation of Prognostic Models , 2003, Stroke.

[49]  P. A. R. Koopman,et al.  Confidence intervals for the ratio of two binomial proportions , 1984 .

[50]  Javier M. Moguerza,et al.  Support Vector Machines with Applications , 2006, math/0612817.

[51]  Mousumi Banerjee,et al.  Identifying representative trees from ensembles , 2012, Statistics in medicine.

[52]  M. Kohler,et al.  Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory , 2014, Biometrical journal. Biometrische Zeitschrift.

[53]  Andreas Ziegler,et al.  Consumer credit risk: Individual probability estimates using machine learning , 2013, Expert Syst. Appl..

[54]  Irwin Guttman,et al.  Statistical inference for Pr(Y < X): The normal case , 1986 .

[55]  Yufeng Liu,et al.  Probability estimation for large-margin classifiers , 2008 .