Species distribution models can be highly sensitive to algorithm configuration

In pursuit of a more robust provenance in the field of species distribution modelling, an extensive literature search was undertaken to find the typical default values, and the range of values, for configuration settings of a large number of the most commonly used statistical algorithms available for constructing species distribution models (SDM), as implemented in the R script packages (such as Dismo and Biomod2) or other species distribution modelling programs like MaxEnt. We found that documentation of SDM algorithm configuration option settings in the SDM literature is, overall, very uncommon, and the justifications for these settings were minimal, when present. Such settings were often the R default values, or were the result of trial and error. This is potentially concerning since: (i) it detracts from the robustness of the provenance for such SDM studies; (ii) a lack of documentation of configuration option settings in a paper prevents the replication of an experiment, which contravenes one of the main tenets of the scientific method; (iii) inappropriate or uninformed configuration option settings are particularly concerning if they represent a poorly understood ecological variable or process, and if the algorithm is sensitive to such settings, this could result in erroneous and/or unrealistic SDMs. Therefore, this study sets out to comprehensively test the sensitivity of eight widely used SDM algorithms to variation in configuration options settings: MaxEnt, Artificial Neural Network (ANN), Generalized Linear Model (GLM), Generalized Additive Model (GAM), Multivariate Adaptive Regression Splines (MARS), Flexible Discriminant Analysis (FDA), Surface Range Envelope (SRE) and Classification tree analysis (CTA). A process of expert elicitation was used to derive a range of appropriate values with which to test the sensitivity of our algorithms. We chose to use species occurrence records for two species - Koala (Phascolartos cinereus) and Thorny Devil (Moloch horridus) - in order to investigate how algorithm sensitivity depends on the species being modelled. Results were assessed by comparing the modelled distribution of the control SDM (default settings) to the modelled distribution from each sensitivity test SDM (i.e. non-default configuration settings). This was done using the visual and statistical measures of predictive performance available in the Biodiversity and Climate Change Virtual Laboratory (BCCVL), including the area under the (receiver operating characteristic) curve. The aim of our study was to be able to draw conclusions as to how the sensitivity of SDM algorithms to their configuration option settings may detract from the reliability of SDM results, given the often unjustified and unscrutinized use of the default settings, and generally infrequent and largely perfunctory attendance to this issue in most of the published SDM literature. Our results indicate that all of the algorithms tested showed sensitivity to alternative (non-default) values for some of their configuration settings and that often this sensitivity is species-dependent. Therefore we can conclude that the choice of configuration settings in these widely used SDM algorithms can have a large impact on the resulting projected distribution. This has important ramifications for decision-making and policy outcomes wherever SDMs are used to inform species and biodiversity management plans and policy settings. This study demonstrates that assigning suitable values for these settings is a very important consideration and as such should always be published along with the model. Documenting all configuration settings is necessary to increase the scientific robustness, transparency and reproducibility of species distribution modelling studies.

[1]  S. Eddy,et al.  Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences1 , 2003, Plant Physiology.

[2]  M. Convertino,et al.  Scale- and resolution-invariance of suitable geographic range for shorebird metapopulations , 2011 .

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  F. Jiguet,et al.  Predictive distribution models applied to satellite tracks: modelling the western African winter range of European migrant Black Storks Ciconia nigra , 2010, Journal of Ornithology.

[5]  Fabiana Soares Santana,et al.  Sensitivity Analysis to Configuration Option Settings in a Selection of Species Distribution Modelling Algorithms , 2017 .

[6]  Trevor Hastie,et al.  A statistical explanation of MaxEnt for ecologists , 2011 .

[7]  D. Hamby A review of techniques for parameter sensitivity analysis of environmental models , 1994, Environmental monitoring and assessment.

[8]  Matthew J. Smith,et al.  The Effects of Sampling Bias and Model Complexity on the Predictive Performance of MaxEnt Species Distribution Models , 2013, PloS one.

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  M. Araújo,et al.  BIOMOD – a platform for ensemble forecasting of species distributions , 2009 .

[11]  TIM M. BLACKBURN,et al.  Reproducibility and Repeatability in Ecology , 2006 .

[12]  K. Williams,et al.  Delineating environmental envelopes to improve mapping of species distributions, via a hurdle model with CART &/or MaxEnt , 2015 .

[13]  J. Friedman Multivariate adaptive regression splines , 1990 .

[14]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[15]  Kerrie Mengersen,et al.  Elicitation by design in ecology: using expert opinion to inform priors for Bayesian statistical models. , 2009, Ecology.

[16]  J. L. Parra,et al.  Very high resolution interpolated climate surfaces for global land areas , 2005 .

[17]  J. Franklin Moving beyond static species distribution models in support of conservation biogeography , 2010 .

[18]  R. Tibshirani,et al.  Flexible Discriminant Analysis by Optimal Scoring , 1994 .

[19]  Catherine S. Jarnevich,et al.  Minimizing effects of methodological decisions on interpretation and prediction in species distribution studies: An example with background selection , 2017 .

[20]  J. Busby BIOCLIM - a bioclimate analysis and prediction system , 1991 .

[21]  Gerhard Weis,et al.  The Biodiversity and Climate Change Virtual Laboratory: Where ecology meets big data , 2016, Environ. Model. Softw..

[22]  J. Bedia,et al.  Background sampling and transferability of species distribution model ensembles under climate change , 2018, Global and Planetary Change.

[23]  A. Townsend Peterson,et al.  Rethinking receiver operating characteristic analysis applications in ecological niche modeling , 2008 .

[24]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[25]  Robert P. Anderson,et al.  Maximum entropy modeling of species geographic distributions , 2006 .

[26]  Trevor Hastie,et al.  Generalized linear and generalized additive models in studies of species distributions: setting the scene , 2002 .

[27]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Sovan Lek,et al.  Artificial neural networks as a tool in ecological modelling, an introduction , 1999 .

[29]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[30]  F. Jiguet,et al.  Selecting pseudo‐absences for species distribution models: how, where and how many? , 2012 .

[31]  Peter D. Wilson,et al.  Which species distribution models are more (or less) likely to project broad-scale, climate-induced shifts in species ranges? , 2016 .

[32]  Miroslav Dudík,et al.  Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation , 2008 .

[33]  Igor Linkov,et al.  Integrated Modeling to Mitigate Climate Change Risk Due to Sea Level Rise , 2011 .

[34]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[35]  M. Austin Spatial prediction of species distribution: an interface between ecological theory and statistical modelling , 2002 .

[36]  Igor Linkov,et al.  Untangling drivers of species distributions: Global sensitivity and uncertainty analyses of MaxEnt , 2014, Environ. Model. Softw..

[37]  Matthew B. Jones,et al.  Challenges and Opportunities of Open Data in Ecology , 2011, Science.

[38]  Matthew J. Smith,et al.  Protected areas network is not adequate to protect a critically endangered East Africa Chelonian: Modelling distribution of pancake tortoise, Malacochersus tornieri under current and future climates , 2013, bioRxiv.

[39]  J. Elith,et al.  Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution models , 2009 .

[40]  Robert P. Anderson,et al.  Species-specific tuning increases robustness to sampling bias in models of species distributions: An implementation with Maxent , 2011 .

[41]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.