CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases

Species occurrence records from online databases are an indispensable resource in ecological, biogeographical and palaeontological research. However, issues with data quality, especially incorrect geo‐referencing or dating, can diminish their usefulness. Manual cleaning is time‐consuming, error prone, difficult to reproduce and limited to known geographical areas and taxonomic groups, making it impractical for datasets with thousands or millions of records. Here, we present CoordinateCleaner, an r‐package to scan datasets of species occurrence records for geo‐referencing and dating imprecisions and data entry errors in a standardized and reproducible way. CoordinateCleaner is tailored to problems common in biological and palaeontological databases and can handle datasets with millions of records. The software includes (a) functions to flag potentially problematic coordinate records based on geographical gazetteers, (b) a global database of 9,691 geo‐referenced biodiversity institutions to identify records that are likely from horticulture or captivity, (c) novel algorithms to identify datasets with rasterized data, conversion errors and strong decimal rounding and (d) spatio‐temporal tests for fossils. We describe the individual functions available in CoordinateCleaner and demonstrate them on more than 90 million occurrences of flowering plants from the Global Biodiversity Information Facility (GBIF) and 19,000 fossil occurrences from the Palaeobiology Database (PBDB). We find that in GBIF more than 3.4 million records (3.7%) are potentially problematic and that 179 of the tested contributing datasets (18.5%) might be biased by rasterized coordinates. In PBDB, 1205 records (6.3%) are potentially problematic. All cleaning functions and the biodiversity institution database are open‐source and available within the CoordinateCleaner r‐package.

[1]  Hjalmar S. Kühl,et al.  A world of sequences: can we use georeferenced nucleotide databases for a robust automated phylogeography? , 2017 .

[2]  David E. Schindel,et al.  The Global Registry of Biodiversity Repositories: A Call for Community Curation , 2016, Biodiversity data journal.

[3]  Yohay Carmel,et al.  Quantifying the value of user-level data cleaning for big data: A case study using mammal distribution models , 2016, Ecol. Informatics.

[4]  Daphne E. Lee,et al.  Testing the Biases in the Rich Cenozoic Angiosperm Macrofossil Record , 2016, International Journal of Plant Sciences.

[5]  Mark P. Robertson,et al.  Biogeo: an R package for assessing and improving data quality of occurrence record datasets , 2016 .

[6]  Scott Chamberlain,et al.  Interface to the Global 'Biodiversity' Information Facility'API' , 2016 .

[7]  R. Bivand,et al.  Tools for Reading and Handling Spatial Objects , 2016 .

[8]  Hadley Wickham,et al.  Tools to Make Developing R Packages Easier , 2016 .

[9]  S. Wright,et al.  The global spectrum of plant form and function , 2015, Nature.

[10]  Colin W. Rundel,et al.  Interface to Geometry Engine - Open Source (GEOS) , 2015 .

[11]  E. Pebesma,et al.  Classes and Methods for Spatial Data , 2015 .

[12]  Alexandre Antonelli,et al.  Estimating species diversity and distribution in the era of Big Data: to what extent can we trust public databases? , 2015, Global ecology and biogeography : a journal of macroecology.

[13]  S. Peters,et al.  paleobioDB: an R package for downloading, visualizing and processing data from the Paleobiology Database , 2015 .

[14]  Sara Varela,et al.  Download and Process Data from the Paleobiology Database , 2014 .

[15]  David C. Tank,et al.  Three keys to the radiation of angiosperms into freezing environments , 2013, Nature.

[16]  Cástor Guisande,et al.  ModestR: a software tool for managing and analyzing species distribution map databases , 2013 .

[17]  Hadley Wickham,et al.  ggmap: Spatial Visualization with ggplot2 , 2013, R J..

[18]  J. Lobo,et al.  Using species distribution models in paleobiogeography: A matter of data, predictors and concepts , 2011 .

[19]  Hadley Wickham,et al.  testthat: Get Started with Testing , 2011, R J..

[20]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[21]  Chris J. Johnson,et al.  Sensitivity of species-distribution models to error, bias, and model design: An application to resource selection functions for woodland caribou , 2008 .

[22]  Tim Sutton,et al.  How Global Is the Global Biodiversity Information Facility? , 2007, PloS one.

[23]  A. Townsend Peterson,et al.  The influence of spatial errors in species occurrence data used in distribution models , 2007 .

[24]  J. Edwards,et al.  The Global Biodiversity Information Facility (GBIF) , 2007 .

[25]  M. Foote Origination and extinction components of taxonomic diversity: general problems , 2000, Paleobiology.

[26]  J M Adrain,et al.  An empirical assessment of taxic paleobiology. , 2000, Science.

[27]  J. Sepkoski,et al.  Ten years in the library: new data confirm paleontological patterns , 1993, Paleobiology.