The effect of sample size on the accuracy of species distribution models: considering both presences and pseudo‐absences or background sites

Most high-performing species distribution modelling techniques require both presences, and either absences or pseudo-absences or background points. In this paper, we explore the effect of sample size, towards developing improved strategies for modelling. We generated 1800 virtual species with three levels of prevalence using ten modelling techniques, while varying the number of training presences (NTP) and the number of random points (NRP representing pseudo-absences or background sites). For five of the ten modelling techniques we built two versions of models: one with an equal total weight (ETW) setting where the total weight for pseudo-absence is equivalent to the total weight for presence, and another with an unequal total weight (UTW) setting where the total weight for pseudo-absence is not required to be equal to the total weight for presence. We compared two strategies for NRP: a small multiplier strategy (i.e. setting NRP at a few times as large as NTP), and a large number strategy (i.e. using numerous random points). We produced ensemble models (by averaging the predictions from 30 models built with the same set of training presences and different sets of random points in equivalent numbers) for three NTP magnitudes and two NRP strategies. We found that model accuracy altered as NRP increased with four distinct patterns of performance: increasing, decreasing, arch-shaped and horizontal. In most cases ETW improved model performance. Ensemble models had higher accuracy than the corresponding single models, and this improvement was pronounced when NTP was low. We conclude that a large NRP is not always an appropriate strategy. The best choice for NRP will depend on the modelling techniques used, species prevalence and NTP. We recommend building ensemble models instead of single models, using the small multiplier strategy for NRP with ETW, especially when only a small number of species presence records are available.

[1]  Graeme Newell,et al.  Species distribution modelling for conservation planning in Victoria, Australia , 2013 .

[2]  R. Kadmon,et al.  EFFECT OF ROADSIDE BIAS ON THE ACCURACY OF PREDICTIVE MAPS PRODUCED BY BIOCLIMATIC MODELS , 2004 .

[3]  Simon J. Pittman,et al.  Multi-Scale Approach for Predicting Fish Species Distributions across Coral Reef Seascapes , 2011, PloS one.

[4]  Rosa M. Chefaoui,et al.  Assessing the effects of pseudo-absences on predictive distribution model performance , 2008 .

[5]  J. Elith,et al.  Species distribution modeling with R , 2016 .

[6]  Miroslav Dudík,et al.  Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation , 2008 .

[7]  Á. Felicísimo,et al.  Do Stacked Species Distribution Models Reflect Altitudinal Diversity Patterns? , 2012, PloS one.

[8]  D. R. Cutler,et al.  Utah State University From the SelectedWorks of , 2017 .

[9]  R. Pearson,et al.  Predicting species distributions from small numbers of occurrence records: A test case using cryptic geckos in Madagascar , 2006 .

[10]  Rosa M. Chefaoui,et al.  Effects of species’ traits and data characteristics on distribution models of threatened invertebrates , 2011 .

[11]  Antoine Guisan,et al.  Predictive habitat distribution models in ecology , 2000 .

[12]  S. Ferrier,et al.  Extended statistical approaches to modelling spatial pattern in biodiversity in northeast New South Wales. I. Species-level modelling , 2004, Biodiversity & Conservation.

[13]  P. Hernandez,et al.  Predicting species distributions in poorly-studied landscapes , 2008, Biodiversity and Conservation.

[14]  A. Townsend Peterson,et al.  Novel methods improve prediction of species' distributions from occurrence data , 2006 .

[15]  A. Peterson,et al.  New developments in museum-based informatics and applications in biodiversity analysis. , 2004, Trends in ecology & evolution.

[16]  David R. B. Stockwell,et al.  Effects of sample size on accuracy of species distribution models , 2002 .

[17]  M. Hutchinson,et al.  The effect of species response form on species distribution model prediction and inference , 2009 .

[18]  J. Hoeting,et al.  FACTORS AFFECTING SPECIES DISTRIBUTION PREDICTIONS: A SIMULATION MODELING EXPERIMENT , 2005 .

[19]  M. Araújo,et al.  An evaluation of methods for modelling species distributions , 2004 .

[20]  J. Drake,et al.  Modelling ecological niches with support vector machines , 2006 .

[21]  L. Comte,et al.  Species distribution modelling and imperfect detection: comparing occupancy versus consensus methods , 2013 .

[22]  M. White,et al.  Measuring and comparing the accuracy of species distribution models with presence–absence data , 2011 .

[23]  R. Kadmon,et al.  A SYSTEMATIC ANALYSIS OF FACTORS AFFECTING THE PERFORMANCE OF CLIMATIC ENVELOPE MODELS , 2003 .

[24]  K. Ekschmitt,et al.  Influence of grain size on species–habitat models , 2011 .

[25]  Susan P. Worner,et al.  Novel Three-Step Pseudo-Absence Selection Technique for Improved Species Distribution Modelling , 2013, PloS one.

[26]  A. Peterson,et al.  No silver bullets in correlative ecological niche modelling: insights from testing among many potential algorithms for niche estimation , 2015 .

[27]  Antoine Guisan,et al.  Do pseudo-absence selection strategies influence species distribution models and their predictions? An information-theoretic approach based on simulated data , 2009, BMC Ecology.

[28]  Mark New,et al.  Ensemble forecasting of species distributions. , 2007, Trends in ecology & evolution.

[29]  P. Hernandez,et al.  The effect of sample size and species characteristics on performance of different species distribution modeling methods , 2006 .

[30]  M. White,et al.  Selecting thresholds for the prediction of species occurrence with presence‐only data , 2013 .

[31]  David R. B. Stockwell,et al.  The GARP modelling system: problems and solutions to automated spatial prediction , 1999, Int. J. Geogr. Inf. Sci..

[32]  J. Elith,et al.  Do they? How do they? WHY do they differ? On finding reasons for differing performances of species distribution models , 2009 .

[33]  Miguel B. Araújo,et al.  sdm: a reproducible and extensible R platform for species distribution modelling , 2016 .

[34]  C. Capinha,et al.  Assessing the environmental requirements of invaders using ensembles of distribution models , 2011 .

[35]  Steven J. Phillips,et al.  Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. , 2009, Ecological applications : a publication of the Ecological Society of America.

[36]  Miroslav Dudík,et al.  Correcting sample selection bias in maximum entropy density estimation , 2005, NIPS.

[37]  J. Franklin,et al.  Effect of species rarity on the accuracy of species distribution models for reptiles and amphibians in southern California , 2009 .

[38]  L. Gibson,et al.  Dealing with uncertain absences in habitat modelling: a case study of a rare ground‐dwelling parrot , 2007 .

[39]  D. Hilbert,et al.  LIVES: a new habitat modelling technique for predicting the distribution of species’ occurrences using presence-only data based on limiting factor theory , 2008, Biodiversity and Conservation.

[40]  A. Guisan,et al.  An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data , 2004 .

[41]  Robert P. Anderson,et al.  Maximum entropy modeling of species geographic distributions , 2006 .

[42]  T. Dawson,et al.  Selecting thresholds of occurrence in the prediction of species distributions , 2005 .

[43]  Crow White,et al.  Ecologists should not use statistical significance tests to interpret simulation model results , 2014 .

[44]  D. Chessel,et al.  ECOLOGICAL-NICHE FACTOR ANALYSIS: HOW TO COMPUTE HABITAT-SUITABILITY MAPS WITHOUT ABSENCE DATA? , 2002 .

[45]  Rosa M. Chefaoui,et al.  Large-Scale Prediction of Seagrass Distribution Integrating Landscape Metrics and Environmental Factors: The Case of Cymodocea nodosa (Mediterranean–Atlantic) , 2015, Estuaries and Coasts.

[46]  M. Araújo,et al.  BIOMOD – a platform for ensemble forecasting of species distributions , 2009 .

[47]  M. Robertson,et al.  A PCA‐based modelling technique for predicting environmental suitability for organisms from presence records , 2001 .

[48]  A. Peterson,et al.  Effects of sample size on the performance of species distribution models , 2008 .

[49]  Hong S. He,et al.  Sample sizes and model comparison metrics for species distribution models , 2012 .

[50]  Maggi Kelly,et al.  Support vector machines for predicting distribution of Sudden Oak Death in California , 2005 .

[51]  S. Ferrier,et al.  An evaluation of alternative algorithms for fitting species distribution models using logistic regression , 2000 .

[52]  J. Lobo,et al.  Exploring the effects of quantity and location of pseudo-absences and sampling biases on the performance of distribution models with limited point occurrence data , 2011 .

[53]  S. Lavorel,et al.  Generalized models vs. classification tree analysis: Predicting spatial distributions of plant species at different scales , 2003 .

[54]  A. Mysterud,et al.  Effects of spatial scale and sample size in GPS-based species distribution models: are the best models trivial for red deer management? , 2012, European Journal of Wildlife Research.

[55]  Avishek Chakraborty,et al.  Point pattern modelling for degraded presence‐only data over large regions , 2011 .

[56]  M. White,et al.  On the selection of thresholds for predicting species occurrence with presence‐only data , 2015, Ecology and evolution.

[57]  W. Thuiller Patterns and uncertainties of species' range shifts under climate change , 2004 .

[58]  D. Makowski,et al.  Effects of the Training Dataset Characteristics on the Performance of Nine Species Distribution Models: Application to Diabrotica virgifera virgifera , 2011, PloS one.

[59]  J. Elith,et al.  Sensitivity of predictive species distribution models to change in grain size , 2007 .

[60]  D. Rogers,et al.  The effects of species’ range sizes on the accuracy of distribution models: ecological phenomenon or statistical artefact? , 2004 .

[61]  L. Hannah,et al.  Scale effects in species distribution models: implications for conservation planning under climate change , 2009, Biology Letters.

[62]  J. Fieberg,et al.  Comparative interpretation of count, presence–absence and point methods for species distribution models , 2012 .

[63]  M. White,et al.  The first six principal components derived from eighteen environmental variables , 2018 .

[64]  J. Lobo,et al.  The effect of prevalence and its interaction with sample size on the reliability of species distribution models , 2009 .

[65]  F. Jiguet,et al.  Selecting pseudo‐absences for species distribution models: how, where and how many? , 2012 .

[66]  Mathieu Marmion,et al.  The performance of state-of-the-art modelling techniques depends on geographical distribution of species. , 2009 .

[67]  S. Lek,et al.  Ensemble modelling of species distribution: the effects of geographical and environmental ranges , 2011 .

[68]  C. Meynard,et al.  Using virtual species to study species distributions and model performance , 2013 .

[69]  Wilfried Thuiller,et al.  Climate change threatens European conservation areas , 2011, Ecology letters.

[70]  D. Warton,et al.  Correction note: Poisson point process models solve the “pseudo-absence problem” for presence-only data in ecology , 2010, 1011.3319.