Tree ensembles with rule structured horseshoe regularization

We propose a new Bayesian model for flexible nonlinear regression and classification using tree ensembles. The model is based on the RuleFit approach in Friedman and Popescu (2008) where rules from decision trees and linear terms are used in a L1-regularized regression. We modify RuleFit by replacing the L1-regularization by a horseshoe prior, which is well known to give aggressive shrinkage of noise predictor while leaving the important signal essentially untouched. This is especially important when a large number of rules are used as predictors as many of them only contribute noise. Our horseshoe prior has an additional hierarchical layer that applies more shrinkage a priori to rules with a large number of splits, and to rules that are only satisfied by a few observations. The aggressive noise shrinkage of our prior also makes it possible to complement the rules from boosting in Friedman and Popescu (2008) with an additional set of trees from random forest, which brings a desirable diversity to the ensemble. We sample from the posterior distribution using a very efficient and easily implemented Gibbs sampler. The new model is shown to outperform state-of-the-art methods like RuleFit, BART and random forest on 16 datasets. The model and its interpretation is demonstrated on the well known Boston housing data, and on gene expression data for cancer classification. The posterior sampling, prediction and graphical tools for interpreting the model results are implemented in a publicly available R package.

[1]  Bogdan E. Popescu,et al.  PREDICTIVE LEARNING VIA RULE ENSEMBLES , 2008, 0811.1679.

[2]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[3]  Antoine Danchin,et al.  Classification between normal and tumor tissues based on the pair-wise gene expression ratio , 2004, BMC Cancer.

[4]  H. Chipman,et al.  Bayesian Additive Regression Trees , 2006 .

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  Enes Makalic,et al.  A Simple Sampler for the Horseshoe Estimator , 2015, IEEE Signal Processing Letters.

[7]  C. Carvalho,et al.  Decoupling Shrinkage and Selection in Bayesian Linear Models: A Posterior Summary Perspective , 2014, 1408.0464.

[8]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[9]  Weixin Yao,et al.  Fully Bayesian logistic regression with hyper-LASSO priors for high-dimensional feature selection , 2014, Journal of Statistical Computation and Simulation.

[10]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[11]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[12]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[13]  Aki Vehtari,et al.  Comparison of Bayesian predictive methods for model selection , 2015, Stat. Comput..

[14]  Jonathan M. Garibaldi,et al.  Learning Pathway-based Decision Rules to Classify Microarray Cancer Samples , 2010, GCB.

[15]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[16]  E. Lander,et al.  Gene expression correlates of clinical prostate cancer behavior. , 2002, Cancer cell.

[17]  Bogdan E. Popescu,et al.  Importance Sampled Learning Ensembles , 2003 .

[18]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[19]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[20]  R. Kohn,et al.  Nonparametric regression using Bayesian variable selection , 1996 .

[21]  D. Slonim From patterns to pathways: gene expression data analysis comes of age , 2002, Nature Genetics.

[22]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[23]  David Draper,et al.  GPU-accelerated Gibbs sampling: a case study of the Horseshoe Probit model , 2016, Statistics and Computing.

[24]  James G. Scott,et al.  Handling Sparsity via the Horseshoe , 2009, AISTATS.

[25]  JOHANNES FÜRNKRANZ,et al.  Separate-and-Conquer Rule Learning , 1999, Artificial Intelligence Review.

[26]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[27]  James G. Scott,et al.  Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables , 2012, 1205.0310.

[28]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .