A working guide to boosted regression trees.

1. Ecologists use statistical models for both explanation and prediction, and need techniques that are flexible enough to express typical features of their data, such as nonlinearities and interactions. 2. This study provides a working guide to boosted regression trees (BRT), an ensemble method for fitting statistical models that differs fundamentally from conventional techniques that aim to fit a single parsimonious model. Boosted regression trees combine the strengths of two algorithms: regression trees (models that relate a response to their predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance). The final BRT model can be understood as an additive regression model in which individual terms are simple trees, fitted in a forward, stagewise fashion. 3. Boosted regression trees incorporate important advantages of tree-based methods, handling different types of predictor variables and accommodating missing data. They have no need for prior data transformation or elimination of outliers, can fit complex nonlinear relationships, and automatically handle interaction effects between predictors. Fitting multiple trees in BRT overcomes the biggest drawback of single tree models: their relatively poor predictive performance. Although BRT models are complex, they can be summarized in ways that give powerful ecological insight, and their predictive performance is superior to most traditional modelling methods. 4. The unique features of BRT raise a number of practical issues in model fitting. We demonstrate the practicalities and advantages of using BRT through a distributional analysis of the short-finned eel (Anguilla australis Richardson), a native freshwater fish of New Zealand. We use a data set of over 13 000 sites to illustrate effects of several settings, and then fit and interpret a model using a subset of the data. We provide code and a tutorial to enable the wider use of BRT by ecologists.

[1]  G. Cumming,et al.  Editors Can Lead Researchers to Confidence Intervals, but Can't Make Them Think , 2004, Psychological science.

[2]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[3]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[4]  R. McDowall,et al.  Implications of diadromy for the structuring and modelling of riverine fish communities in New Zealand , 1993 .

[5]  W. L. Chadderton,et al.  Dispersal, disturbance and the contrasting biogeographies of New Zealand’s diadromous and non‐diadromous fish species , 2008 .

[6]  P. McCullagh,et al.  Generalized Linear Models, 2nd Edn. , 1990 .

[7]  A. Townsend Peterson,et al.  Novel methods improve prediction of species' distributions from occurrence data , 2006 .

[8]  Jerome H Friedman,et al.  Multiple additive regression trees with application in epidemiology , 2003, Statistics in medicine.

[9]  B. Reineking,et al.  Constrain to perform: Regularization of habitat models , 2006 .

[10]  Alan J. Miller Subset Selection in Regression , 1992 .

[11]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[12]  Alan J. Miller,et al.  Subset Selection in Regression , 1991 .

[13]  Glenn De ' ath BOOSTED TREES FOR ECOLOGICAL MODELING AND PREDICTION , 2007 .

[14]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[15]  David A. Elston,et al.  Empirical models for the spatial distribution of wildlife , 1993 .

[16]  A. Clarke,et al.  Scaling of metabolic rate with body mass and temperature in teleost fish , 1999 .

[17]  T. Hastie,et al.  Variation in demersal fish species richness in the oceans surrounding New Zealand: an analysis using boosted regression trees , 2006 .

[18]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[19]  J. Friedman Stochastic gradient boosting , 2002 .

[20]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[21]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[22]  Mark R. Segal,et al.  Machine Learning Benchmarks and Random Forest Regression , 2004 .

[23]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[24]  David R. Anderson,et al.  Model Selection and Inference: A Practical Information-Theoretic Approach , 2001 .

[25]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[26]  R. Tibshirani,et al.  Generalized Additive Models , 1991 .

[27]  G. De’ath,et al.  CLASSIFICATION AND REGRESSION TREES: A POWERFUL YET SIMPLE TECHNIQUE FOR ECOLOGICAL DATA ANALYSIS , 2000 .

[28]  Robert P Freckleton,et al.  Why do we still use stepwise modelling in ecology and behaviour? , 2006, The Journal of animal ecology.

[29]  S. T. Buckland,et al.  ANALYSIS OF POPULATION TRENDS FOR FARMLAND BIRDS USING GENERALIZED ADDITIVE MODELS , 2000 .

[30]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[31]  Niklaus E. Zimmermann,et al.  Predicting tree species presence and basal area in Utah: A comparison of stochastic gradient boosting, generalized additive models, and tree-based methods , 2006 .

[32]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[33]  A. Prasad,et al.  Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction , 2006, Ecosystems.