Boosted trees for ecological modeling and prediction.

Accurate prediction and explanation are fundamental objectives of statistical analysis, yet they seldom coincide. Boosted trees are a statistical learning method that attains both of these objectives for regression and classification analyses. They can deal with many types of response variables (numeric, categorical, and censored), loss functions (Gaussian, binomial, Poisson, and robust), and predictors (numeric, categorical). Interactions between predictors can also be quantified and visualized. The theory underpinning boosted trees is presented, together with interpretive techniques. A new form of boosted trees, namely, "aggregated boosted trees" (ABT), is proposed and, in a simulation study, is shown to reduce prediction error relative to boosted trees. A regression data set is analyzed using ABT to illustrate the technique and to compare it with other methods, including boosted trees, bagged trees, random forests, and generalized additive models. A software package for ABT analysis using the R software environment is included in the Appendices together with worked examples.

[1]  T. Hastie,et al.  Variation in demersal fish species richness in the oceans surrounding New Zealand: an analysis using boosted regression trees , 2006 .

[2]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[3]  Douglas H. Johnson The Insignificance of Statistical Significance Testing , 1999 .

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[6]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[7]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[8]  G. De’ath,et al.  Development of a robust classifier of freshwater residence in barramundi (Lates calcarifer) life histories using elemental ratios in scales and boosted regression trees , 2005 .

[9]  David R. Anderson,et al.  Practical Use of the Information-Theoretic Approach , 1998 .

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  B. D. Ripley,et al.  SELECTING AMONGST LARGE CLASSES OF MODELS , 2004 .

[12]  H. Akaike A new look at the statistical model identification , 1974 .

[13]  Glenn De'ath,et al.  Multivariate Regression Trees: A new technique for constrained classification analysis , 2002 .

[14]  David R. Anderson,et al.  Model Selection and Inference: A Practical Information-Theoretic Approach , 2001 .

[15]  David Draper,et al.  Assessment and Propagation of Model Uncertainty , 2011 .

[16]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[17]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[18]  J. Friedman Stochastic gradient boosting , 2002 .

[19]  G. De’ath MULTIVARIATE REGRESSION TREES: A NEW TECHNIQUE FOR MODELING SPECIES–ENVIRONMENT RELATIONSHIPS , 2002 .

[20]  G. De’ath,et al.  CLASSIFICATION AND REGRESSION TREES: A POWERFUL YET SIMPLE TECHNIQUE FOR ECOLOGICAL DATA ANALYSIS , 2000 .

[21]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[22]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[23]  Glenn De'ath,et al.  Classification and regression trees: a powerful yet simple technique for the analysis of complex ecological data , 2000 .

[24]  R. Plant,et al.  Classification trees: An alternative non‐parametric approach for predicting species distributions , 2000 .