A finite-sample simulation study of cross validation in tree-based models

Cross validation (CV) has been widely used for choosing and evaluating statistical models. The main purpose of this study is to explore the behavior of CV in tree-based models. We achieve this goal by an experimental approach, which compares a cross-validated tree classifier with the Bayes classifier that is ideal for the underlying distribution. The main observation of this study is that the difference between the testing and training errors from a cross-validated tree classifier and the Bayes classifier empirically has a linear regression relation. The slope and the coefficient of determination of the regression model can serve as performance measure of a cross-validated tree classifier. Moreover, simulation reveals that the performance of a cross-validated tree classifier depends on the geometry, parameters of the underlying distributions, and sample sizes. Our study can explain, evaluate, and justify the use of CV in tree-based models when the sample size is relatively small.

[1]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[2]  J. Shao Bootstrap Model Selection , 1996 .

[3]  L. Breiman,et al.  Submodel selection and evaluation in regression. The X-random case , 1992 .

[4]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[5]  Martin Anthony,et al.  London WC1E6BT , 2007 .

[6]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[7]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[8]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[9]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[10]  Jun Shao Convergence rates of the generalized information criterion , 1998 .

[11]  W. Rogers,et al.  A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[12]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[13]  B. Efron How Biased is the Apparent Error Rate of a Prediction Rule , 1986 .

[14]  Ping Zhang On the convergence rate of model selection criteria , 1993 .

[15]  Seoung Bum Kim,et al.  FBP: A Frontier-Based Tree-Pruning Algorithm , 2006, INFORMS J. Comput..

[16]  M. Kearns,et al.  Algorithmic stability and sanity-check bounds for leave-one-out cross-validation , 1999 .

[17]  Ping Zhang On the Distributional Properties of Model Selection Criteria , 1992 .

[18]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[19]  Ping Zhang Model Selection Via Multifold Cross Validation , 1993 .