Identifying appropriate spatial scales of predictors in species distribution models with the random forest algorithm

Including predictors in species distribution models at inappropriate spatial scales can decrease the variance explained, add residual spatial autocorrelation (RSA) and lead to the wrong conclusions. Some studies have measured predictors within different buffer sizes (scales) around sample locations, regressed each predictor against the response at each scale and selected the scale with the best model fit as the appropriate scale for this predictor. However, a predictor can influence a species at several scales or show several scales with good model fit due to a bias caused by RSA. This makes the evaluation of all scales with good model fit necessary. With potentially several scales per predictor and multiple predictors to evaluate, the number of predictors can be large relative to the number of data points, potentially impeding variable selection with traditional statistical techniques, such as logistic regression. We trialled a variable selection process using the random forest algorithm, which allows the simultaneous evaluation of several scales of multiple predictors. Using simulated responses, we compared the performance of models resulting from this approach with models using the known predictors at arbitrary and at the known spatial scales. We also apply the proposed approach to a real data set of curlew (Numenius arquata). AIC, AUC and Naglekerke's pseudo R2 of the models resulting from the proposed variable selection were often very similar to the models with the known predictors at known spatial scales. Only two of nine models required the addition of spatial eigenvectors to account for RSA. Arbitrary scale models always required the addition of spatial eigenvectors. 75% (50–100%) of the known predictors were selected at scales similar to the known scale (within 3 km). In the curlew model, predictors at large, medium and small spatial scales were selected, suggesting that for appropriate landscape‐scale models multiple scales need to be evaluated. The proposed approach selected several of the correct predictors at appropriate spatial scales out of 544 possible predictors. Thus, it facilitates the evaluation of multiple spatial scales of multiple predictors against each other in landscape‐scale models.

[1]  G. Chi Applied Spatial Data Analysis with R , 2015 .

[2]  K. Gaston,et al.  Multiple habitat associations: the role of offsite habitat in determining onsite avian density and species richness , 2012 .

[3]  Ute Bradter,et al.  Prediction of National Vegetation Classification communities in the British uplands using environmental data at multiple spatial scales, aerial images and the classifier random forest , 2011 .

[4]  Brendan A. Wintle,et al.  Habitat area, quality and connectivity: striking the balance for efficient conservation , 2011 .

[5]  F van Langevelde,et al.  Spatial autocorrelation and the scaling of species-environment relationships. , 2010, Ecology.

[6]  Thomas N. E. Gray,et al.  Modelling species distribution at multiple spatial scales: gibbon habitat preferences in a fragmented landscape , 2010 .

[7]  Carolin Strobl,et al.  The behaviour of random forest permutation-based variable importance measures under predictor correlation , 2010, BMC Bioinformatics.

[8]  Jennifer A. Miller,et al.  Mapping Species Distributions: Spatial Inference and Prediction , 2010 .

[9]  S. Cornell,et al.  Random Forest characterization of upland vegetation and management burning from aerial imagery , 2009 .

[10]  James D. Malley,et al.  Predictor correlation impacts machine learning algorithms: implications for genomic studies , 2009, Bioinform..

[11]  Chris J. Johnson,et al.  Factors limiting our understanding of ecological scale , 2009 .

[12]  Edzer J. Pebesma,et al.  Applied Spatial Data Analysis with R - Second Edition , 2008, Use R!.

[13]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[14]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[15]  R. G. Davies,et al.  Methods to account for spatial autocorrelation in the analysis of species distributional data : a review , 2007 .

[16]  Alan Y. Chiang,et al.  Generalized Additive Models: An Introduction With R , 2007, Technometrics.

[17]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[18]  M. Knutson,et al.  Scaling Local Species-habitat Relations to the Larger Landscape with a Hierarchical Spatial Count Model , 2007, Landscape Ecology.

[19]  Daniel A Griffith,et al.  Spatial modeling in ecology: the flexibility of eigenfunction spatial analyses. , 2006, Ecology.

[20]  Stéphane Dray,et al.  Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM) , 2006 .

[21]  R. Schooley,et al.  Spatial Heterogeneity and Characteristic Scales of Species–Habitat Relationships , 2006 .

[22]  Douglas H. Johnson,et al.  Proximate and landscape factors influence grassland bird distributions. , 2006, Ecological applications : a publication of the Ecological Society of America.

[23]  S. Wood Generalized Additive Models: An Introduction with R , 2006 .

[24]  Jennifer A Hoeting,et al.  Model selection for geostatistical models. , 2006, Ecological applications : a publication of the Ecological Society of America.

[25]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[26]  Julian J. Faraway,et al.  Extending the Linear Model with R , 2004 .

[27]  Tadashi Miyashita,et al.  Additive and non-additive effects from a larger spatial scale determine small-scale densities in a web spider Neriene brongersmai , 2004, Population Ecology.

[28]  L. Fahrig,et al.  Determining the Spatial Scale of Species' Response to Habitat , 2004 .

[29]  M. Graham CONFRONTING MULTICOLLINEARITY IN ECOLOGICAL MULTIPLE REGRESSION , 2003 .

[30]  N. Gotelli Predicting Species Occurrences: Issues of Accuracy and Scale , 2003 .

[31]  Teja Tscharntke,et al.  SCALE‐DEPENDENT EFFECTS OF LANDSCAPE CONTEXT ON THREE POLLINATOR GUILDS , 2002 .

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Keith C. Hamer,et al.  Scale‐Dependent Effects of Habitat Disturbance on Species Richness in Tropical Forests , 2000 .

[34]  Frank van Langevelde,et al.  Scale of habitat connectivity and colonization in fragmented nuthatch populations , 2000 .

[35]  Jack J. Lennon,et al.  Red-shifts and red herrings in geographical ecology , 2000 .

[36]  V. Saab IMPORTANCE OF SPATIAL SCALE TO HABITAT USE BY BREEDING BIRDS IN RIPARIAN FORESTS: A HIERARCHICAL ANALYSIS , 1999 .

[37]  T. E. Martin ARE MICROHABITAT PREFERENCES OF COEXISTING SPECIES UNDER SELECTION AND ADAPTIVE , 1998 .

[38]  P. Legendre Spatial Autocorrelation: Trouble or New Paradigm? , 1993 .

[39]  J. E. Waltham,et al.  North East England , 1979 .

[40]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[41]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[42]  Stephen T. C. Wong,et al.  Gene Selection and Classification , 2008 .

[43]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[44]  Paola Zuccolotto,et al.  Variable Selection Using Random Forests , 2006 .

[45]  Patrick J. F. Groenen,et al.  Data Analysis, Classification and the Forward Search , 2006 .

[46]  J. Clobert,et al.  Availability and use of public information and conspecific density for settlement decisions in the collared flycatcher , 2004 .

[47]  David R. Anderson,et al.  Understanding AIC and BIC in Model Selection , 2004 .

[48]  J. Wiens Spatial Scaling in Ecology , 1989 .