Asymptotic Theory for Random Forests

Random forests have proven to be reliable predictive algorithms in many application areas. Not much is known, however, about the statistical properties of random forests. Several authors have established conditions under which their predictions are consistent, but these results do not provide practical estimates of random forest errors. In this paper, we analyze a random forest model based on subsampling, and show that random forest predictions are asymptotically normal provided that the subsample size s scales as s(n)/n = o(log(n)^{-d}), where n is the number of training examples and d is the number of features. Moreover, we show that the asymptotic variance can consistently be estimated using an infinitesimal jackknife for bagged ensembles recently proposed by Efron (2014). In other words, our results let us both characterize and estimate the error-distribution of random forest predictions, thus taking a step towards making random forests tools for statistical inference instead of just black-box predictive algorithms.

[1]  W. Hoeffding A Class of Statistics with Asymptotically Normal Distribution , 1948 .

[2]  J. Hájek,et al.  Asymptotic Normality of Simple Linear Rank Statistics Under Alternatives II , 1968 .

[3]  D. Rubinfeld,et al.  Hedonic housing prices and the demand for clean air , 1978 .

[4]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[5]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[6]  P. Hall,et al.  Effects of bagging and bias correction on estimators defined by estimating equations , 2003 .

[7]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  L. Breiman CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS , 2004 .

[10]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[11]  P. Hall,et al.  Properties of bagged nearest neighbour classifiers , 2005 .

[12]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[13]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[14]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[15]  A. Buja,et al.  OBSERVATIONS ON BAGGING , 2006 .

[16]  J. Friedman,et al.  On bagging and nonlinear estimation , 2007 .

[17]  P. Cortez,et al.  A data mining approach to predict forest fires using meteorological data , 2007 .

[18]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[19]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[20]  Joseph Sexton,et al.  Standard errors for bagged and random forest estimators , 2009, Comput. Stat. Data Anal..

[21]  Jiangtao Duan Bootstrap-Based Variance Estimators for A Bagging Predictor. , 2011 .

[22]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[23]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[24]  J. Norris Appendix: probability and measure , 1997 .

[25]  B. Efron Estimation and Accuracy After Model Selection , 2014, Journal of the American Statistical Association.

[26]  Misha Denil,et al.  Narrowing the Gap: Random Forests In Theory and In Practice , 2013, ICML.

[27]  Trevor J. Hastie,et al.  Confidence intervals for random forests: the jackknife and the infinitesimal jackknife , 2013, J. Mach. Learn. Res..

[28]  G. Hooker,et al.  Ensemble Trees and CLTs: Statistical Inference for Supervised Learning , 2014 .

[29]  Stefan Wager,et al.  Uniform Convergence of Random Forests via Adaptive Concentration , 2015 .