Sensitivity Analysis to Configuration Option Settings in a Selection of Species Distribution Modelling Algorithms

In pursuit of a more robust provenance in the field of species distribution modelling, an extensive literature search was undertaken to find the typical default values, and the range of values, for configuration settings of a number of the most commonly used statistical algorithms available for constructing species distribution models (SDM), as implemented in the R script packages (such as Dismo and Biomod2) or other species distribution modelling programs like Maxent. We found that documentation of SDM algorithm configuration option settings in the SDM literature is very uncommon, and the justifications for these settings were minimal, when present. Such settings were often the R default values, or were the result of trial and error. This is potentially concerning for a number of reasons; it detracts from the robustness of the provenance for such SDM studies; a lack of documentation of configuration option settings in a paper prevents the replication of an experiment, which contravenes one of the main tenets of the scientific method. Inappropriate or uninformed configuration option settings are particularly concerning if they represent a poorly understood ecological variable or process, and if the algorithm is sensitive to such settings; this could result in erroneous and/or unrealistic SDMs. We test the sensitivity of two commonly used SDM algorithms to variation in configuration options settings: Random Forests and Boosted Regression Trees. A process of expert elicitation was used to derive a range of appropriate values with which to test the sensitivity of our algorithms. We chose to use species occurrence records for the Koala (Phascolartos cinereus) for our sensitivity tests, since the species has a well known distribution. Results were assessed by comparing the geospatial distribution from each sensitivity test (i.e. altered-settings) SDM for differences compared to the control SDM (i.e. default settings), using geographical information systems (QGIS). In addition, two performance measures were used to compare differences among the altered-setting SDMs to the control. The aim of our study was to be able to draw conclusions as to how reliable reported SDM results may be in light of the sensitivity of their algorithms to certain settings, given the often arbitrary nature of such settings, and the lack of awareness of, and/or attendance to this issue in most of the published SDM literature. Our results indicate that all two algorithms tested showed sensitivity to alternate values for some of their settings. Therefore this study has showed that the choice of configuration option settings in Random Forests and Boosted Regression Trees has an impact on the results, and that assigning suitable values for these settings is a relevant consideration and as such should be always published along with the model.

[1]  C. McAlpine,et al.  Drought-driven change in wildlife distribution and numbers: a case study of koalas in south west Queensland , 2011 .

[2]  J. L. Parra,et al.  Very high resolution interpolated climate surfaces for global land areas , 2005 .

[3]  R. Adams Bat reproduction declines when conditions mimic climate change projections for western North America. , 2010, Ecology.

[4]  Glenn De ' ath BOOSTED TREES FOR ECOLOGICAL MODELING AND PREDICTION , 2007 .

[5]  M. Araújo,et al.  Uses and misuses of bioclimatic envelope modeling. , 2012, Ecology.

[6]  TIM M. BLACKBURN,et al.  Reproducibility and Repeatability in Ecology , 2006 .

[7]  Robert P. Anderson,et al.  Maximum entropy modeling of species geographic distributions , 2006 .

[8]  Limare Nicolas,et al.  Reproducible Research in Computational Science — Santiago 2013-04-15 , 2013 .

[9]  A Kumar,et al.  Biodiversity and Climate Change , 2018 .

[10]  Matthew B. Jones,et al.  Challenges and Opportunities of Open Data in Ecology , 2011, Science.

[11]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[12]  A. Lisle,et al.  Low-density koala (Phascolarctos cinereus) populations in the mulgalands of south-west Queensland. III. Broad-scale patterns of habitat use , 2003 .

[13]  C. McAlpine,et al.  Movement patterns of an arboreal marsupial at the edge of its range: a case study of the koala , 2013, Movement Ecology.

[14]  C. McAlpine,et al.  Physiological Stress in Koala Populations near the Arid Edge of Their Distribution , 2013, PloS one.

[15]  Gerhard Weis,et al.  The Biodiversity and Climate Change Virtual Laboratory: Where ecology meets big data , 2016, Environ. Model. Softw..

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  G. Gordon,et al.  A koala (Phascolarctos cinereus Goldfuss) population crash during drought and heatwave conditions in south-western Queensland , 1988 .

[18]  K. Williams,et al.  Delineating environmental envelopes to improve mapping of species distributions, via a hurdle model with CART &/or MaxEnt , 2015 .