Multiple annotation for biodiversity: developing an annotation framework among biology, linguistics and text technology

Biodiversity information is contained in countless digitized and unprocessed scholarly texts. Although automated extraction of these data has been gaining momentum for years, there are still innumerable text sources that are poorly accessible and require a more advanced range of methods to extract relevant information. To improve the access to semantic biodiversity information, we have launched the BIOfid project (www.biofid.de) and have developed a portal to access the semantics of German language biodiversity texts, mainly from the 19th and 20th century. However, to make such a portal work, a couple of methods had to be developed or adapted first. In particular, text-technological information extraction methods were needed, which extract the required information from the texts. Such methods draw on machine learning techniques, which in turn are trained by learning data. To this end, among others, we gathered the bio text corpus, which is a cooperatively built resource, developed by biologists, text technologists, and linguists. A special feature of bio is its multiple annotation approach, which takes into account both general and biology-specific classifications, and by this means goes beyond previous, typically taxon- or ontology-driven proper name detection. We describe the design decisions and the genuine Annotation Hub Framework underlying the bio annotations and present agreement results. The tools used to create the annotations are introduced, and the use of the data in the semantic portal is described. Finally, some general lessons, in particular with multiple annotation projects, are drawn.

[1]  G. Chierchia,et al.  Reference to Kinds across Language , 1998 .

[2]  Lakshmi M. Akella,et al.  NetiNeti: discovery of scientific names from text using machine learning methods , 2010, BMC Bioinformatics.

[3]  James Pustejovsky,et al.  The Generative Lexicon , 1995, CL.

[4]  Keith S. Donnellan Reference and Definite Descriptions , 1966 .

[5]  Maria Liakata,et al.  Multi-label Annotation in Scientific Articles - The Multi-label Cancer Risk Assessment Corpus , 2016, LREC.

[6]  John F. Sowa,et al.  Knowledge representation: logical, philosophical, and computational foundations , 2000 .

[7]  Jun'ichi Tsujii,et al.  Evaluating contributions of natural language parsers to protein–protein interaction extraction , 2008, Bioinform..

[8]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[9]  Sophia Ananiadou,et al.  COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature , 2019, Biodiversity data journal.

[10]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[11]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[12]  B. Russell V.—Knowledge by Acquaintance and Knowledge by Description , 1911 .

[13]  Mathijs Mul,et al.  Compositionality Decomposed: How do Neural Networks Generalise? , 2019, J. Artif. Intell. Res..

[14]  Indra Neil Sarkar,et al.  Taxongrab: Extracting Taxonomic Names from Text , 2005 .

[15]  Tolga Uslu,et al.  Wikidition: Automatic lexiconization and linkification of text corpora , 2016, it Inf. Technol..

[16]  Klemens Böhm,et al.  Semi-Automated XML Markup of Biosystematic Legacy Literature with the Goldengate Editor , 2007, Pacific Symposium on Biocomputing.

[17]  Alfonso Valencia,et al.  Information extraction in molecular biology , 2002, Briefings Bioinform..

[18]  Angela Hausinger,et al.  Setup of BIOfid, a new Specialised Information Service for Biodiversity Research , 2017 .

[19]  Ron Artstein,et al.  Inter-annotator Agreement , 2017 .

[20]  H. de Kroon,et al.  More than 75 percent decline over 27 years in total flying insect biomass in protected areas , 2017, PloS one.

[21]  Luca Lenzi,et al.  UniGene Tabulator: a full parser for the UniGene format , 2006, Bioinform..

[22]  William B. Langdon,et al.  BioRAT: extracting biological information from full-length papers , 2004, Bioinform..

[23]  James Pustejovsky,et al.  ISO-TimeML and the Annotation of Temporal Information , 2017 .

[24]  G. Mace,et al.  Biodiversity in the Anthropocene: prospects and policy , 2016, Proceedings of the Royal Society B: Biological Sciences.

[25]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[26]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[27]  H. Steward Animal Agency , 2009, The Routledge Handbook of Philosophy of Agency.

[28]  B. Russell II.—On Denoting , 1905 .

[29]  Anne E. Thessen,et al.  Applications of Natural Language Processing in Biodiversity Science , 2012, Adv. Bioinformatics.

[30]  Paula M. Mabee,et al.  Phenex: Ontological Annotation of Phenotypic Diversity , 2010, PloS one.

[31]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[32]  N. Stork,et al.  Scientists' warning to humanity on insect extinctions , 2020, Biological Conservation.

[33]  Mark A. Finlayson,et al.  Overview of Annotation Creation: Processes and Tools , 2017 .

[34]  Saul A. Kripke,et al.  SPEAKER'S REFERENCE and SEMANTIC REFERENCE , 1977 .

[35]  Martha Palmer,et al.  Automated Trait Extraction using ClearEarth, a Natural Language Processing System for Text Mining in Natural Sciences , 2018 .

[36]  Renchu Guan,et al.  Multi-label Deep Learning for Gene Function Annotation in Cancer Pathways , 2018, Scientific Reports.

[37]  Birgitta König-Ries,et al.  Dataset search in biodiversity research: Do metadata in data repositories reflect scholarly information needs? , 2020, PloS one.

[38]  Barry W. Brook,et al.  Biodiversity losses and conservation responses in the Anthropocene , 2017, Science.

[39]  Edith Bolling Anaphora Resolution , 2006 .

[40]  Manfred Consten,et al.  Circularity effects in corpus studies – why annotations sometimes go round in circles , 2012 .

[41]  David R. Morse,et al.  XML schemas and mark-up practices of taxonomic literature , 2011, ZooKeys.

[42]  Christopher Potts The expressive dimension , 2007 .

[43]  James Pustejovsky,et al.  ISO-Space: Annotating Static and Dynamic Spatial Information , 2017 .