Quality issues in georeferencing: From physical collections to digital data repositories for ecological research

Natural history collections constitute an enormous wealth of information of Life on Earth. It is estimated that over 2 billion specimens are preserved at institutions worldwide, of which less than 10% are accessible via biodiversity data aggregators such as GBIF. Moreover, they are a very important resource for eco-evolutionary research, which greatly depends on knowing the precise location where the specimens were collected in order to characterize the environment in which they lived. Yet, only about 55% of the accessible records are georeferenced and only 31% have coordinate uncertainty information, which is critical for conducting rigorous studies. The awareness of this gap of knowledge which hinders the enormous potential of such data in research led to the organization of a workshop which brought together key players in georeferencing of natural history collections. The discussion and outcomes of this workshop are here presented. Natural history collections are a superb record of life on Earth (Holmes et al., 2016). In contrast to simple observations of occurrence, physical samples held in museums, herbaria and other institutions, allow support for reproducible and repeatable research and for new data extraction from the collected individual or sample (e.g., molecular or genetic markers) on a much richer scale than other kinds of representation; that is photographs (but see Lunghi et al., 2020). Furthermore, and as with other observations of occurrence, their recorded date and place of collection makes it possible to link them to the abiotic and biotic conditions in which they lived. For this, one can infer spatio-temporal ecological and evolutionary patterns in their occurrence. The global set of preserved specimens collected over centuries represents a large potential resource for future research (National Academy of Sciences, Engineering, and Medicine, 2020). Some estimations on the total number of preserved specimens held in institutions worldwide (e.g. natural history museums and herbaria) are in the order of 2 billion (Ariño, 2010). In the last decade, a myriad of digitization initiatives, combined with the growth of computer and information technologies, have yielded a growing stream of data flowing from natural history collections institutions into aggregators, such as the Global Biodiversity Information Facility (GBIF) and the Ocean Biodiversity Information System (OBIS). GBIF is an international, publicly funded research infrastructure that plays a key role in channelling these data to end users, mainly researchers. As of November 2020, around 11.6% of GBIF records come from natural history collections. The task of digitization is gargantuan and, despite all this work, accessible digital specimen records still represent, at most, only about 10% of the collection holdings worldwide. Most digitization has been funded and taken place in data-rich regions such as Europe, the Americas and Australia. Moreover, it is crucial for research, notably in species distribution and ecological niche modelling for biogeographical, evolutionary and conservation studies, that these records have been georeferenced, a process by which geographical coordinates are assigned to physical specimens that only have a textual description of their geographic origin. Special consideration needs to be given to sensitive data in order to prevent potential threats to biodiversity (Chapman, 2020; Lunghi et al., 2019; Tulloch et al., 2018). The rigorous resolution of the coordinates where the specimen was collected, together with their uncertainty, is paramount to correctly characterize the environmental conditions and the habitat where an organism lived. It determines the spatial resolution at which research can be safely conducted. Yet, only about 55% of published records purporting to be specimens in GBIF have coordinates and only 31% of these have uncertainty information. In OBIS, all records are georeferenced but also only 31% have coordinate uncertainty. When coordinates are present, but their spatial uncertainty is not, it is not always possible to rigorously extract useful information from environmental datasets. Regrettably, it is still not unusual to find research studies using such data which have overlooked the need for coordinate uncertainty values. Both the lack of information on spatial uncertainty in georeferenced specimens and the disregard of it on the part of the researchers represent an obstacle to the proper and full exploitation of collections data. Georeferencing is a skilled, labour-intensive process which is hard to automate. It generally starts with the interpretation of the © 2020 The Authors. Diversity and Distributions published by John Wiley & Sons Ltd. 2 | BIODIVERSITY LETTER documented location information, which in most cases is hand-written on labels. Locations can be described in multiple and idiosyncratic ways; from clearly detailed and precise places to vaguely defined and sometimes large regions. However, despite its complexity, georeferencing is a well-researched process for which clear and detailed guidelines (e.g., Chapman & Wieczorek, 2006, 2020; Wieczorek et al., 2004) and information standards (Darwin Core Task Group, 2009) have long existed and are known by the collections community. Yet the speed of georeferencing is still slow, and there is a need for training, particularly among smaller collections without digitization experience. Last February, we held a workshop to discuss the state of georeferencing quality of natural history collections as a critical issue for ecological research (for a detailed account of its outcomes, here summarized, see Marcer et al., 2020). The workshop brought together key players in the study and application of georeferencing to biodiversity collections in order to explore the reasons behind the insufficient quality of georeferenced records in data aggregators such as GBIF. To focus the discussion, the participants were given the following two questions, which were analysed and debated in four sessions in two days: 1. What are the reasons why, despite the existence of quality guidelines, protocols, tools and investment of resources on georeferencing, georeferencing data on final public repositories, mainly GBIF, are not of sufficient quality for research purposes? 2. What actions can be taken to solve this situation? From the workshop, it became clear that no single cause can be attributed to this situation. In response to the first question above, the participants converged on a list of different types of causes leading to the current situation: a. Awareness-related—the need for the collections community to better appraise the importance of quality georeferencing through the use of current existing guidelines and standards; b. Collection management systems and databases—most of them are still not fit for the purpose probably due to a lack of sufficient dialog between software vendors and the user community; c. Staff workload—digitization is a time consuming process and georeferencing is often of low priority; d. Tool friendliness—georeferencing tools still require improvement in terms of user friendliness and interoperability; e. Geographic features—there is a lack of publicly shared, global, hierarchical, time-aware, community-vetted geographical directories, gazetteers. After much debate and discussion and in response to question two above, a list of needed actions were identified and prioritized in the following categories: a. Resource availability—it is essential to create shared gazetteers, formulate crowdsourcing and volunteer programs, and make better use of funding while searching for additional funds; b. Centralized support—provide institutional support programs and centralized information resources to georeferencers; c. Automated tools—there is a need to review and enhance existing software tools and develop new ones to enable bulk text processing and interpretation; a cost-effective option would be to start from existing codebases (e.g., the Biogeomancer project (Guralnick et al., 2006)); d. Better databases—Collection management software and databases need to be enhanced with georeferencing capability by means of a two-way dialog between software vendors and the user community; and, e. User stories—there is a need to compile, document and disseminate concrete working experiences from the georeference community which can influence improved georeference practices. Natural history collections have already had a massive impact by documenting life on Earth. With this letter, we make a call to the global collections and research communities to pull together and refine current procedures towards improving georeferencing and research practise. A joint effort will allow us to move forward and capitalize on the enormous wealth of information that natural history collections represent. The development of accurate and thorough georeferencing tools and protocols, and the rigorous use of the generated data in research can be a means to integrate communities with benefits for all. Natural history collections represent a unique science infrastructure which can enable novel and larger scale uses of the global collections resource, delivering vital research and public interpretation. KE Y WORDS eco-evolutionary research, global biodiversity information facility, georeferencing, natural history collections, uncertainty, workshop ACKNOWLEDG EMENTS This work has been possible thanks to the EU Cost Action CA17106: “MOBILISE. Mobilizing Data, Experts and Policies in Scientific Collections.” We would also like to acknowledge the facilities and hospitality provided by the personnel hosting the event (February 2020) at the Biological and Chemical Research Centre of the University of Warsaw in Poland. Arnald Marcer1,2 Elspeth Haston3 Quentin Groom4 Arturo H. Ariño5 Arthur D. Chapman6 Torkild Bakken7 Paul Braun8 Mathias Dillen9 Marcus Ernst9 Agustí Escobar1 David Fichtm