Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory

Probability estimation for binary and multicategory outcome using logistic and multinomial logistic regression has a long‐standing tradition in biostatistics. However, biases may occur if the model is misspecified. In contrast, outcome probabilities for individuals can be estimated consistently with machine learning approaches, including k‐nearest neighbors (k‐NN), bagged nearest neighbors (b‐NN), random forests (RF), and support vector machines (SVM). Because machine learning methods are rarely used by applied biostatisticians, the primary goal of this paper is to explain the concept of probability estimation with these methods and to summarize recent theoretical findings. Probability estimation in k‐NN, b‐NN, and RF can be embedded into the class of nonparametric regression learning machines; therefore, we start with the construction of nonparametric regression estimates and review results on consistency and rates of convergence. In SVMs, outcome probabilities for individuals are estimated consistently by repeatedly solving classification problems. For SVMs we review classification problem and then dichotomous probability estimation. Next we extend the algorithms for estimating probabilities using k‐NN, b‐NN, and RF to multicategory outcomes and discuss approaches for the multicategory probability estimation problem using SVM. In simulation studies for dichotomous and multicategory dependent variables we demonstrate the general validity of the machine learning methods and compare it with logistic regression. However, each method fails in at least one simulation scenario. We conclude with a discussion of the failures and give recommendations for selecting and tuning the methods. Applications to real data and example code are provided in a companion article (doi:10.1002/bimj.201300077).

[1]  C. J. Stone,et al.  Consistent Nonparametric Regression , 1977 .

[2]  R. Pyke,et al.  Logistic disease incidence models and case-control studies , 1979 .

[3]  L. Devroye,et al.  Distribution-Free Consistency Results in Nonparametric Discrimination and Regression Function Estimation , 1980 .

[4]  C. Spiegelman,et al.  Consistent Window Estimation in Nonparametric Regression , 1980 .

[5]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[6]  Luc Devroye,et al.  Any Discrimination Rule Can Have an Arbitrarily Bad Probability of Error for Finite Sample Size , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  J. J. Narraway,et al.  Probability machines , 1989, Microprocess. Microprogramming.

[8]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[9]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[10]  C. Denniston Introduction to risk calculation in genetic counseling. , 1992 .

[11]  G. Lugosi,et al.  On the Strong Universal Consistency of Nearest Neighbor Regression Function Estimates , 1994 .

[12]  Gábor Lugosi,et al.  Nonparametric estimation via empirical risk minimization , 1995, IEEE Trans. Inf. Theory.

[13]  Daniel Enache,et al.  Statistical Models and Artificial Neural Networks , 1996 .

[14]  J Benichou,et al.  Graphs to estimate an individualized risk of breast cancer. , 1996, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[15]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[16]  P. Spreij Probability and Measure , 1996 .

[17]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[18]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[19]  J B Carlin,et al.  Analysis of binary outcomes in longitudinal studies using weighted estimating equations and discrete-time survival methods: prevalence and incidence of smoking in an adolescent cohort. , 1999, Statistics in medicine.

[20]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[21]  E. Claus Risk models in genetic epidemiology , 2000, Statistical methods in medical research.

[22]  Vladimir,et al.  Choosing Multiple Parameters forSupport , 2000 .

[23]  Adam Krzyzak,et al.  Nonparametric regression estimation using penalized least squares , 2001, IEEE Trans. Inf. Theory.

[24]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[25]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[26]  Marc G. Genton,et al.  Classes of Kernels for Machine Learning: A Statistics Perspective , 2002, J. Mach. Learn. Res..

[27]  M. Kohler Universal Consistency of Local Polynomial Kernel Regression Estimates , 2002 .

[28]  Grace Wahba,et al.  Soft and hard classification by reproducing kernel Hilbert space methods , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[30]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[31]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[32]  Yi Lin Multicategory Support Vector Machines, Theory, and Application to the Classification of . . . , 2003 .

[33]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[34]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[35]  L. Breiman CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS , 2004 .

[36]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[37]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[38]  Tong Zhang,et al.  Statistical Analysis of Some Multi-Category Large Margin Classification Methods , 2004, J. Mach. Learn. Res..

[39]  Carlotta Domeniconi,et al.  Nearest neighbor ensemble , 2004, ICPR 2004.

[40]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[41]  Yi Lin,et al.  Support Vector Machines and the Bayes Rule in Classification , 2002, Data Mining and Knowledge Discovery.

[42]  Harro Walk Strong universal consistency of smooth kernel regression estimates , 2005 .

[43]  P. Hall,et al.  Properties of bagged nearest neighbour classifiers , 2005 .

[44]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[45]  P. Royston,et al.  Building Multivariable Regression Models with Continuous Covariates in Clinical Epidemiology , 2005, Methods of Information in Medicine.

[46]  Javier M. Moguerza,et al.  Support Vector Machines with Applications , 2006, math/0612817.

[47]  Paul Sajda,et al.  Machine learning for detection and diagnosis of disease. , 2006, Annual review of biomedical engineering.

[48]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[49]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[50]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[51]  Adam Krzyżak,et al.  Rates of convergence for partitioning and nearest neighbor regression estimates with unbounded data , 2006 .

[52]  A. Buja,et al.  OBSERVATIONS ON BAGGING , 2006 .

[53]  J. Friedman,et al.  On bagging and nonlinear estimation , 2007 .

[54]  N. Zhang,et al.  Scan Statistics With Weighted Observations , 2007 .

[55]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[56]  Yufeng Liu,et al.  Robust Truncated Hinge Loss Support Vector Machines , 2007 .

[57]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[58]  Ingo Steinwart,et al.  Fast rates for support vector machines using Gaussian kernels , 2007, 0708.1838.

[59]  A Ziegler,et al.  EDITOR Comments on ‘Practical experiences on the necessity of external validation’ , 2008 .

[60]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[61]  A. Tsybakov,et al.  Sparsity oracle inequalities for the Lasso , 2007, 0705.3308.

[62]  Xiaotong Shen,et al.  On L1-Norm Multiclass Support Vector Machines , 2007 .

[63]  Patrick Royston,et al.  Multivariable Model-Building: A Pragmatic Approach to Regression Analysis based on Fractional Polynomials for Modelling Continuous Variables , 2008 .

[64]  Robin Genuer,et al.  Random Forests: some methodological insights , 2008, 0811.3619.

[65]  Cesare Furlanello,et al.  Machine learning methods for predictive proteomics , 2007, Briefings Bioinform..

[66]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[67]  I R König,et al.  Patient-centered yes/no prognosis using learning machines , 2008, Int. J. Data Min. Bioinform..

[68]  Yufeng Liu,et al.  Probability estimation for large-margin classifiers , 2008 .

[69]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[70]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[71]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[72]  Pang-Ning Tan,et al.  kNN: k-Nearest Neighbors , 2009 .

[73]  C E Minder,et al.  On Graphically Checking Goodness-of-fit of Binary Logistic Regression Models. , 2009, Methods of information in medicine.

[74]  Adam Krzyżak,et al.  Optimal global rates of convergence for nonparametric regression with unbounded data , 2009 .

[75]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[76]  Elizabeth A Stuart,et al.  Improving propensity score weighting using machine learning , 2010, Statistics in medicine.

[77]  Hao Helen Zhang,et al.  Robust Model-Free Multiclass Probability Estimation , 2010, Journal of the American Statistical Association.

[78]  Luc Devroye,et al.  On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification , 2010, J. Multivar. Anal..

[79]  Arnaud Guyader,et al.  On the Rate of Convergence of the Bagged Nearest Neighbor Estimate , 2010, J. Mach. Learn. Res..

[80]  M. Yuan,et al.  Reinforced Multicategory Support Vector Machines , 2011 .

[81]  Yufeng Liu,et al.  Non-crossing large-margin probability estimation and its application to robust SVM via preconditioning. , 2011, Statistical methodology.

[82]  Andreas Ziegler,et al.  On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data , 2010, Bioinform..

[83]  E. Steyerberg,et al.  [Regression modeling strategies]. , 2011, Revista espanola de cardiologia.

[84]  C. Peota Novel approach. , 2011, Minnesota medicine.

[85]  Hao Helen Zhang,et al.  Hard or Soft Classification? Large-Margin Unified Machines , 2011, Journal of the American Statistical Association.

[86]  Harald Binder,et al.  Leveraging external knowledge on molecular interactions in classification methods for risk prediction of patients , 2011, Biometrical journal. Biometrische Zeitschrift.

[87]  James D. Malley,et al.  Statistical Learning for Biomedical Data: Preface , 2011 .

[88]  Qiang Yang,et al.  SVM: Support Vector Machines , 2011 .

[89]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[90]  Robin Genuer,et al.  Variance reduction in purely random forests , 2012 .

[91]  Andreas Ziegler,et al.  Risk estimation and risk prediction using machine-learning methods , 2012, Human Genetics.

[92]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[93]  Khairani Omar,et al.  The impact of a disease management program (COACH) on the attainment of better cardiovascular risk control in dyslipidaemic patients at primary care centres (The DISSEMINATE Study): a randomised controlled trial , 2012, BMC Family Practice.

[94]  R. D'Agostino Cardiovascular risk estimation in 2012: lessons learned and applicability to the HIV population. , 2012, The Journal of infectious diseases.

[95]  Shah Ebrahim,et al.  European Guidelines on Cardiovascular Disease Prevention in Clinical Practice (Version 2012) , 2012, International Journal of Behavioral Medicine.

[96]  J. D. Malley,et al.  Probability Machines , 2011, Methods of Information in Medicine.

[97]  Ewout W Steyerberg,et al.  Regression trees for predicting mortality in patients with cardiovascular disease: What improvement is achieved by using ensemble-based methods? , 2012, Biometrical journal. Biometrische Zeitschrift.

[98]  Jelle J Goeman,et al.  Resolving confusion of tongues in statistics and machine learning: A primer for biologists and bioinformaticians , 2012, Proteomics.

[99]  R. Samworth Optimal weighted nearest neighbour classifiers , 2011, 1101.5783.

[100]  Yufeng Liu,et al.  Adaptively Weighted Large Margin Classifiers , 2013, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[101]  Andreas Ziegler,et al.  Consumer credit risk: Individual probability estimates using machine learning , 2013, Expert Syst. Appl..

[102]  Jialiang Li,et al.  Multicategory reclassification statistics for assessing improvements in diagnostic accuracy. , 2013, Biostatistics.

[103]  Yufeng Liu,et al.  Multicategory large-margin unified machines , 2013, J. Mach. Learn. Res..

[104]  J. Cox,et al.  A novel approach to cardiovascular health by optimizing risk management (ANCHOR): behavioural modification in primary care effectively reduces global risk. , 2013, The Canadian journal of cardiology.

[105]  Christian Weimar,et al.  Probability estimation with machine learning methods for dichotomous and multicategory outcome: Applications , 2014, Biometrical journal. Biometrische Zeitschrift.

[106]  J. Knottnerus,et al.  Clinical prediction models are not being validated. , 2015, Journal of clinical epidemiology.