Variable Selection Issues in Tree-Based Regression Models

Recently, there has been increasing interest in the use of classification and regression tree (CART) analysis. A tree-based regression model can be constructed by recursively partitioning the data with such criteria as to yield the maximum reduction in the variability of the response. Unfortunately, the exhaustive search may yield a bias in variable selection, and it tends to choose a categorical variable as a splitter that has many distinct values. In this study, an unbiased tree-based regression generalized unbiased interaction detection and estimation (GUIDE) model is introduced for its robustness against the variable selection bias. Not only are the underlying theoretical differences behind CART and GUIDE in variable selection presented, but also the outcomes of the two different tree-based regression models are compared and analyzed by utilizing intersection inventory and crash data. The results underscore GUIDE's strength in selecting variables equally. A simulation shed additional light on the resulting negative impact when an algorithm was inappropriately applied to the data. This paper concludes by addressing the strengths and weaknesses of—and, more important, the differences between—the two hierarchical tree-based regression models, CART and GUIDE, and advises on the appropriate application. It is anticipated that the GUIDE model will provide a new perspective for users of tree-based models and will offer an advantage over existing methods. Users in transportation should choose the appropriate method and utilize it to their advantage.

[1]  Simon Washington,et al.  Binary Recursive Partitioning Method for Modeling Hot-Stabilized Emissions From Motor Vehicles , 1997 .

[2]  Dominique Lord,et al.  Modeling motor vehicle crashes using Poisson-gamma models: examining the effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter. , 2006, Accident; analysis and prevention.

[3]  Eric R. Ziegel,et al.  An Introduction to Generalized Linear Models , 2002, Technometrics.

[4]  Douglas M. Hawkins,et al.  FIRM: Formal Inference-Based Recursive Modeling , 1991 .

[5]  J R Stewart,et al.  Applications of Classification and Regression Tree Methods in Roadway Safety Studies , 1996 .

[6]  Matthew G Karlaftis,et al.  Effects of road geometry and traffic volumes on rural roadway accident rates. , 2002, Accident; analysis and prevention.

[7]  Yu-Shan Shih,et al.  Variable selection bias in regression trees with constant fits , 2004, Comput. Stat. Data Anal..

[8]  Peter Doyle,et al.  The Use of Automatic Interaction Detector and Similar Search Procedures , 1973 .

[9]  W. Loh,et al.  REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[10]  Simon Washington,et al.  Hierarchical Tree-Based Versus Ordinary Least Squares Linear Regression Models: Theory and Example Applied to Trip Generation , 1997 .

[11]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[12]  Teresa M Adams,et al.  Regression Tree Models to Predict Winter Storm Costs , 2006 .

[13]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[14]  Mohamed Abdel-Aty,et al.  Analysis of Types of Crashes at Signalized Intersections by Using Complete Crash Data and Tree-Based Regression , 2005 .

[15]  Simon Washington,et al.  Iteratively specified tree-based regression : theory and trip generation example , 2000 .

[16]  A. Dobson An Introduction to Generalized Linear Models, Second Edition , 2001 .

[17]  A. J. Feelders,et al.  Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation , 1999, PKDD.

[18]  Fedel Frank Saccomanno,et al.  Collision Frequency Analysis Using Tree-Based Stratification , 2005 .