Making Sense of Random Forest Probabilities: a Kernel Perspective

A random forest is a popular tool for estimating probabilities in machine learning classification tasks. However, the means by which this is accomplished is unprincipled: one simply counts the fraction of trees in a forest that vote for a certain class. In this paper, we forge a connection between random forests and kernel regression. This places random forest probability estimation on more sound statistical footing. As part of our investigation, we develop a model for the proximity kernel and relate it to the geometry and sparsity of the estimation problem. We also provide intuition and recommendations for tuning a random forest to improve its probability estimates.

[1]  Gabriel J. Escobar,et al.  Nonelective Rehospitalizations and Postdischarge Mortality , 2015, Medical care.

[2]  Dennis Lock,et al.  Using random forests to estimate win probability before each play of an NFL game , 2014 .

[3]  L. Breiman CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS , 2004 .

[4]  Erwan Scornet,et al.  Random Forests and Kernel Methods , 2015, IEEE Transactions on Information Theory.

[5]  Erwan Scornet,et al.  A random forest guided tour , 2015, TEST.

[6]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[7]  M. Kohler,et al.  Probability estimation with machine learning methods for dichotomous and multicategory outcome: Theory , 2014, Biometrical journal. Biometrische Zeitschrift.

[8]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[9]  J. Maindonald Statistical Learning from a Regression Perspective , 2008 .

[10]  Hitinder S. Gurm,et al.  A Random Forest Based Risk Model for Reliable and Accurate Prediction of Receipt of Transfusion in Patients Undergoing Percutaneous Coronary Intervention , 2014, PloS one.

[11]  David Mease Cost-Weighted Boosting with Jittering and Over / Under-Sampling : JOUS-Boost , 2004 .

[12]  J. Evans,et al.  Modeling Species Distribution and Change Using Random Forest , 2011 .

[13]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[14]  Henrik Boström Estimating class probabilities in random forests , 2007, ICMLA 2007.

[15]  Jean-Philippe Vert,et al.  Consistency of Random Forests , 2014, 1405.2881.

[16]  Juanjuan Fan,et al.  Propensity score and proximity matching using random forest. , 2016, Contemporary clinical trials.

[17]  Chunyang Li Probability Estimation in Random Forests , 2013 .

[18]  Erwan Scornet,et al.  On the asymptotics of random forests , 2014, J. Multivar. Anal..

[19]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[20]  J. D. Malley,et al.  Probability Machines , 2011, Methods of Information in Medicine.

[21]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[22]  Elizabeth A Stuart,et al.  Improving propensity score weighting using machine learning , 2010, Statistics in medicine.

[23]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.