Human-Machine Information Extraction Simulator for Biological Collections

In the last decade, institutions from around the world have implemented initiatives for digitizing biological collections (biocollections) and sharing their information online. The transcription of the metadata from photographs of specimens’ labels is performed through human-centered approaches (e.g., crowdsourcing) because fully automated Information Extraction (IE) methods still generate a significant number of errors. The integration of human and machine tasks has been proposed to accelerate the IE from the billions of specimens waiting to be digitized. Nevertheless, in order to conduct research and trying new techniques, IE practitioners need to prepare sets of images, crowdsourcing experiments, recruit volunteers, process the transcriptions, generate ground truth values, program automated methods, etc. These research resources and processes require time and effort to be developed and architected into a functional system. In this paper, we present a simulator intended to accelerate the ability to experiment with workflows for extracting Darwin Core (DC) terms from images of specimens. The so-called HuMaIN Simulator includes the engine, the human-machine IE workflows for three DC terms, the code of the automated IE methods, crowdsourced and ground truth transcriptions of the DC terms of three biocollections, and several experiments that exemplify its potential use. The simulator adds Human-in-the-loop capabilities, for iterative IE and research on optimal methods. Its practical design permits the quick definition, customization, and implementation of experimental IE scenarios.

[1]  Katja C. Seltmann,et al.  Accelerating the Digitization of Biodiversity Research Specimens through Online Public Participation , 2015 .

[2]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[3]  Evangelos E. Milios,et al.  MiBio: A dataset for OCR post-processing evaluation , 2018, Data in brief.

[4]  Frank Puppe,et al.  State of the Art Optical Character Recognition of 19th Century Fraktur Scripts using Open Source Engines , 2018, DHd.

[5]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[6]  Richard K. Rabeler,et al.  Digitization workflows for flat sheets and packets of plants, algae, and fungi1 , 2015, Applications in plant sciences.

[7]  C. Davis,et al.  Large-scale digitization of herbarium specimens: Development and usage of an automated, high-throughput conveyor system , 2018 .

[8]  José A. B. Fortes,et al.  Workforce-efficient consensus in crowdsourced transcription of biocollections information , 2016, Future Gener. Comput. Syst..

[9]  Alexander Panchenko,et al.  Entity-Centric Information Access with Human in the Loop for the Biomedical Domain , 2017, BiomedicalNLP@RANLP.

[10]  Philipp Klimant,et al.  Virtual Reality for Virtual Commissioning of Automated Guided Vehicles , 2019, 2019 IEEE Conference on Virtual Reality and 3D User Interfaces (VR).

[11]  J. Edwards,et al.  The Global Biodiversity Information Facility (GBIF) , 2007 .

[12]  Thomas Nash,et al.  Worldwide Engagement for Digitizing Biocollections (WeDigBio): The Biocollections Community's Citizen-Science Space on the Calendar , 2018, Bioscience.

[13]  Anna Lisa Gentile,et al.  Multi-lingual Concept Extraction with Linked Data and Human-in-the-Loop , 2017, K-CAP.

[14]  David Gutiérrez-Larruscain,et al.  Phylogeny of the Inula group (Asteraceae: Inuleae): Evidence from nuclear and plastid genomes and a recircumscription of Pentanema , 2018 .

[15]  Yvonne Bogenstätter,et al.  How Accurate Is Information Transmitted to Medical Professionals Joining a Medical Emergency? A Simulator Study , 2009, Hum. Factors.

[16]  Wael Hassan Gomaa,et al.  A Survey of Text Similarity Approaches , 2013 .

[17]  Arturo H. Ariño APPROACHES TO ESTIMATING THE UNIVERSE OF NATURAL HISTORY COLLECTIONS DATA , 2010 .

[18]  José A. B. Fortes,et al.  SELFIE: Self-Aware Information Extraction from Digitized Biocollections , 2017, 2017 IEEE 13th International Conference on e-Science (e-Science).

[19]  Paul Flemons,et al.  Image based Digitisation of Entomology Collections: Leveraging volunteers to increase digitization capacity , 2012, ZooKeys.

[20]  Ling Rothrock,et al.  Human-in-the-Loop Simulations: Methods and Practice , 2011 .

[21]  Thomas Deselaers,et al.  A Scalable Handwritten Text Recognition System , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[22]  S. Ellis,et al.  The history and impact of digitization and digital data mobilization on biodiversity research , 2018, Philosophical Transactions of the Royal Society B.

[23]  Anna Lisa Gentile,et al.  Mining Relations from Unstructured Content , 2018, PAKDD.

[24]  Xiaopeng Li,et al.  The Driver-in-the-Loop Simulation on Regenerative Braking Control of Four-Wheel Drive HEVs , 2019 .

[25]  Anton Güntsch,et al.  A benchmark dataset of herbarium specimen images with label data , 2019, Biodiversity data journal.

[26]  Barbara Carminati,et al.  CrowdEval: A Cost-Efficient Strategy to Evaluate Crowdsourced Worker's Reliability , 2018, AAMAS.

[27]  Julie Chen,et al.  The bloodhound project: automating discovery of web usability issues using the InfoScentπ simulator , 2003, CHI '03.