metboost: Exploratory regression analysis with hierarchically clustered data

As data collections become larger, exploratory regression analysis becomes more important but more challenging. When observations are hierarchically clustered the problem is even more challenging because model selection with mixed effect models can produce misleading results when nonlinear effects are not included into the model (Bauer and Cai, 2009). A machine learning method called boosted decision trees (Friedman, 2001) is a good approach for exploratory regression analysis in real data sets because it can detect predictors with nonlinear and interaction effects while also accounting for missing data. We propose an extension to boosted decision decision trees called metboost for hierarchically clustered data. It works by constraining the structure of each tree to be the same across groups, but allowing the terminal node means to differ. This allows predictors and split points to lead to different predictions within each group, and approximates nonlinear group specific effects. Importantly, metboost remains computationally feasible for thousands of observations and hundreds of predictors that may contain missing values. We apply the method to predict math performance for 15,240 students from 751 schools in data collected in the Educational Longitudinal Study 2002 (Ingels et al., 2007), allowing 76 predictors to have unique effects for each school. When comparing results to boosted decision trees, metboost has 15% improved prediction performance. Results of a large simulation study show that metboost has up to 70% improved variable selection performance and up to 30% improved prediction performance compared to boosted decision trees when group sizes are small

[1]  Alan Edelman,et al.  Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[2]  Hao Helen Zhang,et al.  Variable Selection for Semiparametric Mixed Models in Longitudinal Studies , 2010, Biometrics.

[3]  Liang Li,et al.  Boosted multivariate trees for longitudinal data , 2016, Machine Learning.

[4]  Ulman Lindenberger,et al.  Theory-guided exploration with structural equation model forests. , 2016, Psychological methods.

[5]  D. Harville Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems , 1977 .

[6]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[7]  Peter Buhlmann,et al.  BOOSTING ALGORITHMS: REGULARIZATION, PREDICTION AND MODEL FITTING , 2007, 0804.2752.

[8]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[9]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[10]  S. R. Searle,et al.  The Matrix Handling of BLUE and BLUP in the Mixed Linear Model , 1997 .

[11]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[12]  Jerome H Friedman,et al.  Multiple additive regression trees with application in epidemiology , 2003, Statistics in medicine.

[13]  D. Bates,et al.  Mixed-Effects Models in S and S-PLUS , 2001 .

[14]  Emil Pitkin,et al.  Peeking Inside the Black Box: Visualizing Statistical Learning With Plots of Individual Conditional Expectation , 2013, 1309.6392.

[15]  G. Tutz,et al.  An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. , 2009, Psychological methods.

[16]  Denis Larocque,et al.  Mixed-effects random forest for clustered data , 2014 .

[17]  J. Friedman Stochastic gradient boosting , 2002 .

[18]  S. Wood Generalized Additive Models: An Introduction with R , 2006 .

[19]  Li Cai,et al.  Consequences of Unmodeled Nonlinear Effects in Multilevel Models , 2009 .

[20]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[21]  J. de Leeuw,et al.  Prediction in Multilevel Models , 2005 .

[22]  Jeffrey S. Simonoff,et al.  RE-EM trees: a data mining approach for longitudinal and clustered data , 2011, Machine Learning.

[23]  D. Barr,et al.  Random effects structure for confirmatory hypothesis testing: Keep it maximal. , 2013, Journal of memory and language.

[24]  Bruce Thompson,et al.  Stepwise Regression and Stepwise Discriminant Analysis Need Not Apply here: A Guidelines Editorial , 1995 .

[25]  Gerhard Tutz,et al.  Variable selection for generalized linear mixed models by L1-penalized estimation , 2012, Statistics and Computing.

[26]  D. Bates,et al.  fitting linear mixed effects models using lme 4 arxiv , 2014 .

[27]  T Hothorn,et al.  Detecting treatment-subgroup interactions in clustered data with generalized linear mixed-effects model trees , 2017, Behavior Research Methods.

[28]  Francis Tuerlinckx,et al.  Changing Dynamics: Time-Varying Autoregressive Models Using Generalized Additive Modeling , 2017, Psychological methods.

[29]  Robert Bozick,et al.  Education Longitudinal Study of 2002 (ELS: 2002) , 2007 .

[30]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[31]  Paul De Boeck,et al.  IRTrees: Tree-Based Item Response Models of the GLMM Family , 2012 .

[32]  Ulman Lindenberger,et al.  Structural equation model trees. , 2013, Psychological methods.

[33]  Runze Li,et al.  VARIABLE SELECTION IN LINEAR MIXED EFFECTS MODELS. , 2012, Annals of statistics.

[34]  C. R. Henderson,et al.  Best linear unbiased estimation and prediction under a selection model. , 1975, Biometrics.

[35]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[36]  Benjamin Hofner,et al.  Model-based boosting in R: a hands-on tutorial using the R package mboost , 2012, Computational Statistics.

[37]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[38]  S. Rabe-Hesketh,et al.  Prediction in multilevel generalized linear models , 2009 .

[39]  K. Hornik,et al.  Model-Based Recursive Partitioning , 2008 .