Paying it forward: Crowdsourcing of taxonomic harmonization 1 and linking of biodiversity identifiers 2

5 Linking records for the same taxa between different databases is an essential step when working with 6 biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms 7 (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, 8 most projects will require some degree of curation to ensure that taxon identifiers are correctly linked. 9 Unfortunately, formal guidance on such curation is uncommon, and these steps are often ad hoc and 10 poorly documented, which hinders transparency and reproducibility, yet the task requires specialist 11 knowledge and cannot be easily automated without careful validation. Here we present a case study on 12 linking identifiers between the GBIF and NCBI taxonomies for a species checklist dataset. This 13 represents a common usage scenario: finding publicly available sequencing data (available from 14 NCBI) for species chosen by their occurrence or geographical distribution (from GBIF). Wikidata, a 15 publicly editable knowledge base of structured data, can serve as an additional information source for 16 identifier linking. We suggest a software toolkit for taxon name matching and data cleaning, describe 17 common issues encountered during curation, and propose concrete steps to address them. For example, 18 about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in 19 name matching caused by homonyms. By correcting such errors during data cleaning, either directly 20 (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the 21 curation and contribute to improvement of community resources, thereby improving the quality of 22 downstream analyses. 23

[1]  D. Mietchen,et al.  Ten quick tips for editing Wikidata , 2023, PLoS Comput. Biol..

[2]  M. Watson,et al.  The big four of plant taxonomy - a comparison of global checklists of vascular plant names. , 2023, The New phytologist.

[3]  Deborah L Paul,et al.  The disambiguation of people names in biological collections , 2022, Biodiversity data journal.

[4]  M. Winter,et al.  Harmonizing taxon names in biodiversity data: A review of tools, databases and best practices , 2021, Methods in Ecology and Evolution.

[5]  Daniel S. Park,et al.  A review of the heterogeneous landscape of biodiversity databases: opportunities and challenges for a synthesized biodiversity knowledge base , 2021, Global Ecology and Biogeography.

[6]  Alexander R. Pico,et al.  WikiPathways: connecting communities , 2020, Nucleic Acids Res..

[7]  Scott Chamberlain,et al.  taxadb: A high‐performance local taxonomic database interface , 2020, Methods in Ecology and Evolution.

[8]  M. Hoffmann,et al.  A new classification of Carex (Cyperaceae) subgenera supported by a HybSeq backbone phylogenetic tree , 2020, Botanical Journal of the Linnean Society.

[9]  Thomas Shafee,et al.  Wikidata as a knowledge graph for the life sciences , 2020, eLife.

[10]  Jun Yang,et al.  BDcleaner: A workflow for cleaning taxonomic and geographic errors in occurrence data archived in biodiversity databases , 2020 .

[11]  Lu Sun,et al.  NCBI Taxonomy: a comprehensive update on curation, resources and tools , 2020, Database J. Biol. Databases Curation.

[12]  Quentin Groom,et al.  Using Crowd-curation to Improve Taxon Annotations on the Wikimedia Infrastructure , 2019, Biodiversity Information Science and Standards.

[13]  Anne E. Thessen,et al.  20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration , 2018, PeerJ Prepr..

[14]  Brent S. Pedersen,et al.  Bioconda: sustainable and comprehensive software distribution for the life sciences , 2018, Nature Methods.

[15]  D. Hawksworth,et al.  International Code of Nomenclature for algae, fungi, and plants , 2018, Regnum Vegetabile.

[16]  Nico M. Franz,et al.  To increase trust, change the social design behind aggregated biodiversity data , 2017, bioRxiv.

[17]  Dmitry Y. Mozzherin,et al.  “gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar , 2017, BMC Bioinformatics.

[18]  Benjamin M. Good,et al.  WikiGenomes: an open web application for community consumption and curation of gene annotation data in Wikidata , 2017, bioRxiv.

[19]  Anne Thessen,et al.  Challenges with using names to link digital biodiversity information , 2016, Biodiversity data journal.

[20]  Richard L Pyle,et al.  Towards a Global Names Architecture: The future of indexing scientific names , 2016, ZooKeys.

[21]  David Remsen,et al.  The use and limits of scientific names in biological informatics , 2016, ZooKeys.

[22]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[23]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.