An alternative pruning based approach to unbiased recursive partitioning

Tree-based methods are a non-parametric modelling strategy that can be used in combination with generalized linear models or Cox proportional hazards models, mostly at an exploratory stage. Their popularity is mainly due to the simplicity of the technique along with the ease in which the resulting model can be interpreted. Variable selection bias from variables with many possible splits or missing values has been identified as one of the problems associated with tree-based methods. A number of unbiased recursive partitioning algorithms have been proposed that avoid this bias by using p -values in the splitting procedure of the algorithm. The final tree is obtained using direct stopping rules (pre-pruning strategy) or by growing a large tree first and pruning it afterwards (post-pruning). Some of the drawbacks of pre-pruned trees based on p -values in the presence of interaction effects and a large number of explanatory variables are discussed, and a simple alternative post-pruning solution is presented that allows the identification of such interactions. The proposed method includes a novel pruning algorithm that uses a false discovery rate (FDR) controlling procedure for the determination of splits corresponding to significant tests. The new approach is demonstrated with simulated and real-life examples.

[1]  Achim Zeileis,et al.  A Toolkit for Recursive Partytioning , 2015 .

[2]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[3]  Mark R. Segal,et al.  Regression Trees for Censored Data , 1988 .

[4]  T. Therneau,et al.  An Introduction to Recursive Partitioning Using the RPART Routines , 2015 .

[5]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[6]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[7]  Yu-Shan Shih,et al.  Variable selection bias in regression trees with constant fits , 2004, Comput. Stat. Data Anal..

[8]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[9]  C. Scarrott,et al.  Prediction of Oncotype DX and TAILORx risk categories using histopathological and immunohistochemical markers by classification and regression tree (CART) analysis. , 2013, Breast.

[10]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[11]  Allan P. White,et al.  Technical Note: Bias in Information-Based Measures in Decision Tree Induction , 1994, Machine Learning.

[12]  Hyunjoong Kim,et al.  Classification Trees With Unbiased Multiway Splits , 2001 .

[13]  Achim Zeileis,et al.  evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R , 2014 .

[14]  W. Loh,et al.  REGRESSION TREES WITH UNBIASED VARIABLE SELECTION AND INTERACTION DETECTION , 2002 .

[15]  Trevor Hastie,et al.  Statistical Models in S , 1991 .

[16]  M. LeBlanc,et al.  Survival Trees by Goodness of Split , 1993 .

[17]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[18]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[19]  R. Olshen,et al.  Tree-structured survival analysis. , 1985, Cancer treatment reports.

[20]  Helmut Strasser,et al.  On the Asymptotic Theory of Permutation Statistics , 1999 .

[21]  Hyunjoong Kim,et al.  Classification Trees With Bivariate Linear Discriminant Node Models , 2003 .

[22]  R B Davis,et al.  Exponential survival trees. , 1989, Statistics in medicine.

[23]  W. Loh,et al.  Tree-Structured Classification via Generalized Discriminant Analysis. , 1988 .

[24]  K. Hornik,et al.  Model-Based Recursive Partitioning , 2008 .