Do pseudo-absence selection strategies influence species distribution models and their predictions? An information-theoretic approach based on simulated data

BackgroundMultiple logistic regression is precluded from many practical applications in ecology that aim to predict the geographic distributions of species because it requires absence data, which are rarely available or are unreliable. In order to use multiple logistic regression, many studies have simulated "pseudo-absences" through a number of strategies, but it is unknown how the choice of strategy influences models and their geographic predictions of species. In this paper we evaluate the effect of several prevailing pseudo-absence strategies on the predictions of the geographic distribution of a virtual species whose "true" distribution and relationship to three environmental predictors was predefined. We evaluated the effect of using a) real absences b) pseudo-absences selected randomly from the background and c) two-step approaches: pseudo-absences selected from low suitability areas predicted by either Ecological Niche Factor Analysis: (ENFA) or BIOCLIM. We compared how the choice of pseudo-absence strategy affected model fit, predictive power, and information-theoretic model selection results.ResultsModels built with true absences had the best predictive power, best discriminatory power, and the "true" model (the one that contained the correct predictors) was supported by the data according to AIC, as expected. Models based on random pseudo-absences had among the lowest fit, but yielded the second highest AUC value (0.97), and the "true" model was also supported by the data. Models based on two-step approaches had intermediate fit, the lowest predictive power, and the "true" model was not supported by the data.ConclusionIf ecologists wish to build parsimonious GLM models that will allow them to make robust predictions, a reasonable approach is to use a large number of randomly selected pseudo-absences, and perform model selection based on an information theoretic approach. However, the resulting models can be expected to have limited fit.

[1]  Carsten Rahbek,et al.  Using potential distributions to explore determinants of Western Palaearctic migratory songbird species richness in sub-Saharan Africa , 2007 .

[2]  Trevor Hastie,et al.  Generalized linear and generalized additive models in studies of species distributions: setting the scene , 2002 .

[3]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[4]  Miroslav Dudík,et al.  A maximum entropy approach to species distribution modeling , 2004, ICML.

[5]  Antoine Guisan,et al.  Predictive habitat distribution models in ecology , 2000 .

[6]  J. Leathwick Are New Zealand's Nothofagus species in equilibrium with their environment? , 1998 .

[7]  Jerald B. Johnson,et al.  Model selection in ecology and evolution. , 2004, Trends in ecology & evolution.

[8]  Mark S. Boyce,et al.  Modelling distribution and abundance with presence‐only data , 2006 .

[9]  J. Busby BIOCLIM - a bioclimate analysis and prediction system , 1991 .

[10]  A. Prasad,et al.  Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction , 2006, Ecosystems.

[11]  G. Carpenter,et al.  DOMAIN: a flexible modelling procedure for mapping potential distributions of plants and animals , 1993, Biodiversity & Conservation.

[12]  W. Thuiller,et al.  Predicting species distribution: offering more than simple habitat models. , 2005, Ecology letters.

[13]  H. Akaike INFORMATION THEORY AS AN EXTENSION OF THE MAXIMUM LIKELIHOOD , 1973 .

[14]  A. Hirzel,et al.  Which is the optimal sampling strategy for habitat suitability modelling , 2002 .

[15]  M. Araújo,et al.  Presence-absence versus presence-only modelling methods for predicting bird habitat suitability , 2004 .

[16]  J. Townshend,et al.  Towards an operational MODIS continuous field of percent tree cover algorithm: examples using AVHRR and MODIS data , 2002 .

[17]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[18]  Antoine Guisan,et al.  Are niche-based species distribution models transferable in space? , 2006 .

[19]  S. Ormerod,et al.  New paradigms for modelling species distributions , 2004 .

[20]  A. Peterson,et al.  New developments in museum-based informatics and applications in biodiversity analysis. , 2004, Trends in ecology & evolution.

[21]  R. G. Davies,et al.  Methods to account for spatial autocorrelation in the analysis of species distributional data : a review , 2007 .

[22]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[23]  M. Austin Spatial prediction of species distribution: an interface between ecological theory and statistical modelling , 2002 .

[24]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[25]  Mikkel P. Tamstorf,et al.  Modelling critical caribou summer ranges in West Greenland , 2005, Polar Biology.

[26]  Michael Drielsma,et al.  Extended statistical approaches to modelling spatial pattern in biodiversity in northeast New South Wales. II. Community-level modelling , 2002, Biodiversity & Conservation.

[27]  Brendan A. Wintle,et al.  The Use of Bayesian Model Averaging to Better Represent Uncertainty in Ecological Models , 2003 .

[28]  M. Boyce,et al.  Evaluating resource selection functions , 2002 .

[29]  D. Chessel,et al.  ECOLOGICAL-NICHE FACTOR ANALYSIS: HOW TO COMPUTE HABITAT-SUITABILITY MAPS WITHOUT ABSENCE DATA? , 2002 .

[30]  Trevor Hastie,et al.  Making better biogeographical predictions of species’ distributions , 2006 .

[31]  A. Lehmann,et al.  Improving generalized regression analysis for the spatial prediction of forest communities , 2006 .

[32]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[33]  A. Hirzel,et al.  Assessing habitat-suitability models with a virtual species , 2001 .

[34]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[35]  C. Margules,et al.  Nature Conservation: Cost Effective Biological Surveys and Data Analysis , 1990 .

[36]  A. Guisan,et al.  An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data , 2004 .

[37]  A. Townsend Peterson,et al.  Novel methods improve prediction of species' distributions from occurrence data , 2006 .

[38]  S. Ferrier,et al.  An evaluation of alternative algorithms for fitting species distribution models using logistic regression , 2000 .

[39]  M. Robertson,et al.  A PCA‐based modelling technique for predicting environmental suitability for organisms from presence records , 2001 .

[40]  T. Hastie,et al.  Variation in demersal fish species richness in the oceans surrounding New Zealand: an analysis using boosted regression trees , 2006 .

[41]  John Bell,et al.  A review of methods for the assessment of prediction errors in conservation presence/absence models , 1997, Environmental Conservation.

[42]  L. Belbin,et al.  Evaluation of statistical models used for predicting plant species distributions: Role of artificial data and theory , 2006 .

[43]  A. Lehmann,et al.  Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns , 2002 .