Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers

Abstract Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.

[1]  E. Cooper,et al.  Australian National Species List: name identifier management and linkages , 2023, Biodiversity Information Science and Standards.

[2]  D. Mietchen,et al.  Ten quick tips for editing Wikidata , 2023, PLoS Comput. Biol..

[3]  M. Watson,et al.  The big four of plant taxonomy - a comparison of global checklists of vascular plant names. , 2023, The New phytologist.

[4]  A. Thessen,et al.  Improving the discoverability of biodiversity data using the Global Names Finder , 2022, Biodiversity Information Science and Standards.

[5]  C. Steinbeck,et al.  The LOTUS initiative for open knowledge management in natural products research , 2022, eLife.

[6]  Christopher G Chute,et al.  A Simple Standard for Sharing Ontological Mappings (SSSOM) , 2021, Database J. Biol. Databases Curation.

[7]  Daniel S. Park,et al.  A review of the heterogeneous landscape of biodiversity databases: opportunities and challenges for a synthesized biodiversity knowledge base , 2021, Global Ecology and Biogeography.

[8]  R. Page Wikidata and the bibliography of life , 2021, bioRxiv.

[9]  Wei Shen,et al.  TaxonKit: A practical and efficient NCBI taxonomy toolkit. , 2021, Journal of genetics and genomics = Yi chuan xue bao.

[10]  Alexander R. Pico,et al.  WikiPathways: connecting communities , 2020, Nucleic Acids Res..

[11]  M. Hoffmann,et al.  A new classification of Carex (Cyperaceae) subgenera supported by a HybSeq backbone phylogenetic tree , 2020, Botanical Journal of the Linnean Society.

[12]  Thomas Shafee,et al.  Wikidata as a knowledge graph for the life sciences , 2020, eLife.

[13]  Jun Yang,et al.  BDcleaner: A workflow for cleaning taxonomic and geographic errors in occurrence data archived in biodiversity databases , 2020 .

[14]  Lu Sun,et al.  NCBI Taxonomy: a comprehensive update on curation, resources and tools , 2020, Database J. Biol. Databases Curation.

[15]  Quentin Groom,et al.  Using Crowd-curation to Improve Taxon Annotations on the Wikimedia Infrastructure , 2019, Biodiversity Information Science and Standards.

[16]  Anne E. Thessen,et al.  20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration , 2018, PeerJ Prepr..

[17]  Brent S. Pedersen,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[18]  Nico M. Franz,et al.  To increase trust, change the social design behind aggregated biodiversity data , 2017, bioRxiv.

[19]  Dmitry Y. Mozzherin,et al.  “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar , 2017, BMC Bioinformatics.

[20]  Benjamin M. Good,et al.  WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata , 2017, bioRxiv.

[21]  Richard L Pyle,et al.  Towards a Global Names Architecture: The future of indexing scientific names , 2016, ZooKeys.

[22]  David Remsen,et al.  The use and limits of scientific names in biological informatics , 2016, ZooKeys.

[23]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.