occAssess: An R package for assessing potential biases in species occurrence data

Species occurrence records from a variety of sources are increasingly aggregated into heterogeneous databases and made available to ecologists for immediate analytical use. However, these data are typically biased, i.e. they are not a probability sample of the target population of interest, meaning that the information they provide may not be an accurate reflection of reality. It is therefore crucial that species occurrence data are properly scrutinised before they are used for research. In this article, we introduce occAssess, an R package that enables straightforward screening of species occurrence data for potential biases. The package contains a number of discrete functions, each of which returns a measure of the potential for bias in one or more of the taxonomic, temporal, spatial and environmental dimensions. Users can opt to provide a set of time periods into which the data will be split; in this case separate outputs will be provided for each period, making the package particularly useful for assessing the suitability of a dataset for estimating temporal trends in species’ distributions. The outputs are provided visually (as ggplot2 objects) and do not include a formal recommendation as to whether data are of sufficient quality for any given inferential use. Instead, they should be used as ancillary information and viewed in the context of the question that is being asked, and the methods that are being used to answer it. We demonstrate the utility of occAssess by applying it to data on two key pollinator taxa in South America: leaf-nosed bats (Phyllostomidae) and hoverflies (Syrphidae). In this worked example, we briefly assess the degree to which various aspects of data coverage appear to have changed over time. We then discuss additional applications of the package, highlight its limitations, and point to future development opportunities.

[1]  Scott Chamberlain,et al.  Interface to the Global 'Biodiversity' Information Facility'API' , 2016 .

[2]  Pedro M. Valero-Mora,et al.  ggplot2: Elegant Graphics for Data Analysis , 2010 .

[3]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[4]  David B. Roy,et al.  Statistics for citizen science: extracting signals of change from noisy ecological data , 2014 .

[5]  B. Maritz,et al.  Sampling bias in reptile occurrence data for the Kruger National Park , 2020 .

[6]  Min Zhang,et al.  Semaphorin3A induces nerve regeneration in the adult cornea-a switch from its repulsive role in development , 2018, PloS one.

[7]  Kevin J. Gaston,et al.  Common Ecology , 2011 .

[8]  Adrian Baddeley,et al.  Spatial Point Patterns: Methodology and Applications with R , 2015 .

[9]  Arturo H. Ariño,et al.  Research applications of primary biodiversity databases in the digital age , 2019, bioRxiv.

[10]  David L. Smith,et al.  Biased efficacy estimates in phase-III dengue vaccine trials due to heterogeneous exposure and differential detectability of primary infections across trial arms , 2019, PloS one.

[11]  David J Spiegelhalter,et al.  Bias modelling in evidence synthesis , 2009, Journal of the Royal Statistical Society. Series A,.

[12]  D. Franklin Evidence of disarray amongst granivorous bird assemblages in the savannas of northern Australia, a region of sparse human settlement , 1999 .

[13]  P. J. Clark,et al.  Distance to Nearest Neighbor as a Measure of Spatial Relationships in Populations , 1954 .

[14]  Robert P. Guralnick,et al.  Querying and Managing Large Biodiversity Occurrence Datasets [R package occCite version 0.3.0] , 2020 .

[15]  Alejandro Ruete,et al.  Displaying bias in sampling effort of data accessed from biodiversity databases using ignorance maps , 2015, Biodiversity data journal.

[16]  Mark Hill,et al.  Local frequency as a key to interpreting species occurrence data when recording effort is not known , 2012 .

[17]  B. Meatyard New Atlas of the British and Irish Flora: By C.D. Preston, D.A. Pearman and T.D. Dines. Published by Oxford University Press, 2002. ISBN 0 19 851067 5 (Hardback with CD). 912 pages. Price £99.95 , 2003 .

[18]  S. Ellis,et al.  The history and impact of digitization and digital data mobilization on biodiversity research , 2018, Philosophical Transactions of the Royal Society B.

[19]  J. Speed,et al.  Decision letter for "Species data for understanding biodiversity dynamics: The what, where and when of species occurrence data collection" , 2020, Ecological Solutions and Evidence.

[20]  Peter Rothery,et al.  A general method for measuring relative change in range size from biological atlas data , 2002 .

[21]  N. Isaac,et al.  Mapping species distributions in 2 weeks using citizen science , 2019, Insect Conservation and Diversity.

[22]  Tom A. August,et al.  Annual estimates of occupancy for bryophytes, lichens and invertebrates in the UK, 1970–2015 , 2019, Scientific Data.

[23]  Georgina M. Mace,et al.  Distorted Views of Biodiversity: Spatial and Temporal Bias in Species Occurrence Data , 2010, PLoS biology.

[24]  Tim Newbold,et al.  Applications and limitations of museum data for conservation and ecology, with particular attention to species distribution models , 2010 .

[25]  K. Walker,et al.  Temporal changes in distributions and the species atlas: How can British and Irish plant data shoulder the inferential burden? , 2019, British & Irish Botany.

[26]  Wolfgang Schwanghart,et al.  Spatial bias in the GBIF database and its effect on modeling species' geographic distributions , 2014, Ecol. Informatics.

[27]  M. Aizen,et al.  Worldwide occurrence records suggest a global decline in bee species richness , 2021 .

[28]  D. Silvestro,et al.  sampbias , a method for quantifying geographic sampling biases in species distribution data , 2020 .

[29]  C. Preston,et al.  John Ray's Cambridge Catalogue (1660) , 2011 .

[30]  Michael J. O. Pocock,et al.  Bias and information in biological records , 2015 .

[31]  Hugh P Possingham,et al.  Regional avian species declines estimated from volunteer-collected long-term data using List Length Analysis. , 2010, Ecological applications : a publication of the Ecological Society of America.

[32]  Roy Robertson,et al.  An Introduction to Statistical Modelling , 2000, Technometrics.

[33]  N. Zimmermann,et al.  Habitat Suitability and Distribution Models: With Applications in R , 2017 .

[34]  Carsten Meyer,et al.  Multidimensional biases, gaps and uncertainties in global plant occurrence information. , 2016, Ecology letters.

[35]  K. Walker,et al.  The design, launch and assessment of a new volunteer-based plant monitoring scheme for the United Kingdom , 2019, PloS one.

[36]  Stephen E. Fick,et al.  WorldClim 2: new 1‐km spatial resolution climate surfaces for global land areas , 2017 .

[37]  N. Isaac,et al.  Widespread losses of pollinating insects in Britain , 2019, Nature Communications.

[38]  Arco J. van Strien,et al.  Opportunistic citizen science data of animal species produce reliable estimates of distribution trends if analysed with occupancy models , 2013 .

[39]  Philipp H. Boersch-Supan,et al.  Data Integration for Large-Scale Models of Species Distributions. , 2020, Trends in ecology & evolution.

[40]  Gordon S. Blair,et al.  Integrated species distribution models: A comparison of approaches under different data quality scenarios , 2021, Diversity and Distributions.

[41]  Daniele Silvestro,et al.  CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases , 2019, Methods in Ecology and Evolution.

[42]  C D Preston,et al.  Following the BSBI’s lead: the influence of the Atlas of the British flora, 1962–2012 , 2013 .

[43]  Gregory B. Pauly,et al.  Citizen Science as a Tool for Augmenting Museum Collection Data from Urban Areas , 2017, Front. Ecol. Evol..

[44]  Michael J. O. Pocock,et al.  Ecological monitoring with citizen science: the design and implementation of schemes for recording plants in Britain and Ireland , 2015 .

[45]  Emanuele Giorgi,et al.  Spatial point patterns:methodology and applications with R , 2017 .

[46]  Frédéric Legendre,et al.  The Increasing Disconnection of Primary Biodiversity Data from Specimens: How Does It Happen and How to Handle It? , 2018, Systematic biology.

[47]  D. Stoyan Spatial Point Patterns: Methodology and Applications with R. A. Baddeley, E. Rubak, R. Turner (2016). Boca Raton, FL: CRC Press. ISBN: 978‐1‐4822‐1020‐0 (Hardback). , 2017 .

[48]  Steven J. Phillips,et al.  Sample selection bias and presence-only distribution models: implications for background and pseudo-absence data. , 2009, Ecological applications : a publication of the Ecological Society of America.

[49]  D. Silvestro,et al.  sampbias, a method for quantifying geographic sampling biases in species distribution data , 2020, bioRxiv.

[50]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .