The effect of splitting on random forests

The effect of a splitting rule on random forests (RF) is systematically studied for regression and classification problems. A class of weighted splitting rules, which includes as special cases CART weighted variance splitting and Gini index splitting, are studied in detail and shown to possess a unique adaptive property to signal and noise. We show for noisy variables that weighted splitting favors end-cut splits. While end-cut splits have traditionally been viewed as undesirable for single trees, we argue for deeply grown trees (a trademark of RF) end-cut splitting is useful because: (a) it maximizes the sample size making it possible for a tree to recover from a bad split, and (b) if a branch repeatedly splits on noise, the tree minimal node size will be reached which promotes termination of the bad branch. For strong variables, weighted variance splitting is shown to possess the desirable property of splitting at points of curvature of the underlying target function. This adaptivity to both noise and signal does not hold for unweighted and heavy weighted splitting rules. These latter rules are either too greedy, making them poor at recognizing noisy scenarios, or they are overly ECP aggressive, making them poor at recognizing signal. These results also shed light on pure random splitting and show that such rules are the least effective. On the other hand, because randomized rules are desirable because of their computational efficiency, we introduce a hybrid method employing random split-point selection which retains the adaptive property of weighted splitting rules while remaining computational efficient.

[1]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[2]  L. Breiman CONSISTENCY FOR A SIMPLE MODEL OF RANDOM FORESTS , 2004 .

[3]  Luís Torgo,et al.  A Study on End-Cut Preference in Least Squares Regression Trees , 2001, EPIA.

[4]  J. Morgan,et al.  Thaid a Sequential Analysis Program for the Analysis of Nominal Scale Dependent Variables , 1973 .

[5]  P. R. Adby,et al.  The optimization problem , 1974 .

[6]  J. Lamperti ON CONVERGENCE OF STOCHASTIC PROCESSES , 1962 .

[7]  P. Erdos,et al.  On the Law of the Iterated Logarithm , 1942 .

[8]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[9]  Hemant Ishwaran,et al.  Random Survival Forests , 2008, Wiley StatsRef: Statistics Reference Online.

[10]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[11]  J. D. Malley,et al.  Probability Machines , 2011, Methods of Information in Medicine.

[12]  Gérard Biau,et al.  Analysis of a Random Forests Model , 2010, J. Mach. Learn. Res..

[13]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[14]  L. Breiman Technical Note: Some Properties of Splitting Criteria , 1996, Machine Learning.

[15]  C. J. Stone,et al.  Optimal Rates of Convergence for Nonparametric Estimators , 1980 .

[16]  Adele Cutler,et al.  PERT – Perfect Random Tree Ensembles , 2001 .

[17]  Robin Genuer,et al.  Variance reduction in purely random forests , 2012 .

[18]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[19]  Xi Chen,et al.  Random survival forests for high‐dimensional data , 2011, Stat. Anal. Data Min..

[20]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[21]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[22]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[23]  T. Lai,et al.  A Law of the Iterated Logarithm for Double Arrays of Independent Random Variables with Applications to Regression and Time Series Models , 1982 .

[24]  D. Pollard,et al.  Cube Root Asymptotics , 1990 .

[25]  Udaya B. Kogalur,et al.  High-Dimensional Variable Selection for Survival Data , 2010 .

[26]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[27]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[28]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[29]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[30]  L. Wehenkel On uncertainty measures used for decision tree induction , 1996 .