Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections

We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on some of the state-of-the-art technologies. Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images. ‡ § | ‡ ¶

[1]  Kenneth Ward Church Applications of Natural Language Processing , 1997, Künstliche Intell..

[2]  Enrique Vidal,et al.  Handwritten Text Recognition for Historical Documents , 2011 .

[3]  Anton Güntsch,et al.  A benchmark dataset of herbarium specimen images with label data , 2019, Biodiversity data journal.

[4]  Robert Hoehndorf,et al.  The flora phenotype ontology (FLOPO): tool for integrating morphological traits and phenotypes of vascular plants , 2016, J. Biomed. Semant..

[5]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[6]  Robyn E. Drinkwater,et al.  The use of Optical Character Recognition (OCR) in the digitisation of herbarium specimen labels , 2014, PhytoKeys.

[7]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[8]  Yafang Xue,et al.  Optical Character Recognition , 2022 .

[9]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[10]  Stuart Weibel,et al.  The Dublin Core: A Simple Content Description Model for Electronic Resources , 2005 .

[11]  Graziano Pesole,et al.  UvA-DARE ( Digital Academic Repository ) BioVeL : a virtual laboratory for data analysis and modelling in biodiversity science and ecology , 2016 .

[12]  P. Ehrlich,et al.  Biological collections and ecological/environmental research: a review, some observations and a look to the future , 2010, Biological reviews of the Cambridge Philosophical Society.

[13]  Quentin Groom,et al.  An Evaluation of In-house versus Out-sourced Data Capture at the Meise Botanic Garden (BR) , 2018 .

[14]  Alun D. Preece,et al.  FlexiTerm: a flexible term recognition method , 2013, J. Biomed. Semant..

[15]  Liz Woolcott,et al.  Understanding Metadata: What is Metadata, and What is it For?, , 2017 .

[16]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[17]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[18]  Robert Dale,et al.  Handbook of Natural Language Processing , 2001, Computational Linguistics.

[19]  Daniel Crawl,et al.  Workflows and extensions to the Kepler scientific workflow system to support environmental sensor data access and analysis , 2010, Ecol. Informatics.

[20]  Miles Osborne,et al.  Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT '10) , 2010 .

[21]  S. J. Graves,et al.  Mapping the biosphere: exploring species to understand the origin, organization and sustainability of biodiversity , 2012 .

[22]  Jano I. van Hemert,et al.  Scientific Workflows , 2016, ACM Comput. Surv..

[23]  Paul L. Rosin,et al.  Use of Semantic Segmentation for Increasing the Throughput of Digitisation Workflows for Natural History Collections , 2019 .

[24]  Harald Scheidl Handwritten Text Recognition in Historical Documents , 2018 .

[25]  Thomas Nash,et al.  Worldwide Engagement for Digitizing Biocollections (WeDigBio): The Biocollections Community's Citizen-Science Space on the Calendar , 2018, Bioscience.

[26]  A. Suarez,et al.  The Value of Museum Collections for Research and Society , 2004 .

[27]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[28]  Nitin Indurkhya,et al.  Handbook of Natural Language Processing , 2010 .

[29]  Anne E. Thessen,et al.  Applications of Natural Language Processing in Biodiversity Science , 2012, Adv. Bioinformatics.

[30]  John Wieczorek,et al.  Darwin Core: An Evolving Community-Developed Biodiversity Data Standard , 2012, PloS one.

[31]  Irena Spasic,et al.  Acronyms as an Integral Part of Multi-Word Term Recognition – A Token of Appreciation , 2018, IEEE Access.

[32]  Haizhou Li,et al.  Evaluating and Combining Name Entity Recognition Systems , 2016, NEWS@ACM.