Complete Analysis of a Random Forest Model

Random forests have become an important tool for improving accuracy in regression problems since their popularization by (Breiman, 2001) and others. In this paper, we revisit a random forest model originally proposed by (Breiman, 2004) and later studied by (Biau, 2012), where a feature is selected at random and the split occurs at the midpoint of the block containing the chosen feature. If the regression function is sparse and depends only on a small, unknown subset of $ S $ out of $ d $ features, we show that given $ n $ observations, this random forest model outputs a predictor that has a mean-squared prediction error of order $ \left(n\sqrt{\log^{S-1} n}\right)^{-\frac{1}{S\log2+1}} $. When $ S \leq \lfloor 0.72 d \rfloor $, this rate is better than the minimax optimal rate $ n^{-\frac{2}{d+2}} $ for $ d $-dimensional, Lipschitz function classes. As a consequence of our analysis, we show that the variance of the forest decays with the depth of the tree at a rate that is independent of the ambient dimension, even when the trees are fully grown. In particular, if $ \ell_{avg} $ (resp. $ \ell_{max} $) is the average (resp. maximum) number of observations per leaf node, we show that the variance of this forest is $ \Theta\left(\ell^{-1}_{avg}(\sqrt{\log n})^{-(S-1)}\right) $, which for the case of $ S = d $, is similar in form to the lower bound $ \Omega\left(\ell^{-1}_{max}(\log n)^{-(d-1)}\right) $ of (Lin and Jeon, 2006) for any random forest model with a nonadaptive splitting scheme. We also show that the bias is tight for any linear model with nonzero parameter vector. Thus, we completely characterize the fundamental limits of this random forest model. Our new analysis also implies that better theoretical performance can be achieved if the trees are grown less aggressively (i.e., grown to a shallower depth) than previous work would otherwise recommend.

[1]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[2]  Stefan Wager Asymptotic Theory for Random Forests , 2014, 1405.0352.

[3]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[4]  Antoni Zygmund,et al.  Note on the differentiability of multiple integrals , 1934 .

[5]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[6]  Erwan Scornet,et al.  Impact of subsampling and pruning on random forests , 2016, 1603.04261.

[7]  Volker Scheidemann Introduction to Complex Analysis in Several Variables , 2005, Compact Textbooks in Mathematics.

[8]  Robin Genuer,et al.  Variance reduction in purely random forests , 2012 .

[9]  Sylvain Arlot,et al.  Analysis of purely random forests bias , 2014, ArXiv.

[10]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[11]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[12]  Erwan Scornet,et al.  Minimax optimal rates for Mondrian trees and forests , 2018, The Annals of Statistics.

[13]  L. Breiman CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS , 2004 .

[14]  Stefan Wager,et al.  Estimation and Inference of Heterogeneous Treatment Effects using Random Forests , 2015, Journal of the American Statistical Association.

[15]  David Mease,et al.  Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers , 2015, J. Mach. Learn. Res..

[16]  Misha Denil,et al.  Narrowing the Gap: Random Forests In Theory and In Practice , 2013, ICML.

[17]  Jean-Philippe Vert,et al.  Consistency of Random Forests , 2014, 1405.2881.

[18]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[19]  Yuhong Yang,et al.  Information-theoretic determination of minimax rates of convergence , 1999 .

[20]  L. Breiman SOME INFINITY THEORY FOR PREDICTOR ENSEMBLES , 2000 .

[21]  Robin Genuer,et al.  Risk bounds for purely uniformly random forests , 2010, 1006.2980.

[22]  Erwan Scornet,et al.  Random Forests and Kernel Methods , 2015, IEEE Transactions on Information Theory.

[23]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[24]  Stefan Wager,et al.  Adaptive Concentration of Regression Trees, with Application to Random Forests , 2015 .