“gnparser”: a powerful parser for scientific names based on Parsing Expression Grammar

BackgroundScientific names in biology act as universal links. They allow us to cross-reference information about organisms globally. However variations in spelling of scientific names greatly diminish their ability to interconnect data. Such variations may include abbreviations, annotations, misspellings, etc. Authorship is a part of a scientific name and may also differ significantly. To match all possible variations of a name we need to divide them into their elements and classify each element according to its role. We refer to this as ‘parsing’ the name. Parsing categorizes name’s elements into those that are stable and those that are prone to change. Names are matched first by combining them according to their stable elements. Matches are then refined by examining their varying elements. This two stage process dramatically improves the number and quality of matches. It is especially useful for the automatic data exchange within the context of “Big Data” in biology.ResultsWe introduce Global Names Parser (gnparser). It is a Java tool written in Scala language (a language for Java Virtual Machine) to parse scientific names. It is based on a Parsing Expression Grammar. The parser can be applied to scientific names of any complexity. It assigns a semantic meaning (such as genus name, species epithet, rank, year of publication, authorship, annotations, etc.) to all elements of a name. It is able to work with nested structures as in the names of hybrids. gnparser performs with ≈99% accuracy and processes 30 million name-strings/hour per CPU thread. The gnparser library is compatible with Scala, Java, R, Jython, and JRuby. The parser can be used as a command line application, as a socket server, a web-app or as a RESTful HTTP-service. It is released under an Open source MIT license.ConclusionsGlobal Names Parser (gnparser) is a fast, high precision tool for biodiversity informaticians and biologists working with large numbers of scientific names. It can replace expensive and error-prone manual parsing and standardization of scientific names in many situations, and can quickly enhance the interoperability of distributed biological information.

[1]  Alexander A. Myltsev parboiled2: a macro-based approach for effective generators of parsing expressions grammars in Scala , 2019, ArXiv.

[2]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[3]  D J Patterson,et al.  Names are key to the big new biology. , 2010, Trends in ecology & evolution.

[4]  Alfred V. Aho,et al.  The Theory of Parsing, Translation, and Compiling , 1972 .

[5]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[6]  M. V. Regenmortel,et al.  Virus taxonomy: classification and nomenclature of viruses. Seventh report of the International Committee on Taxonomy of Viruses. , 2000 .

[7]  Pasquale Pagano,et al.  Retrieving taxa names from large biodiversity data collections using a flexible matching workflow , 2015, Ecol. Informatics.

[8]  Anne Thessen,et al.  Challenges with using names to link digital biodiversity information , 2016, Biodiversity data journal.

[9]  Lakshmi M. Akella,et al.  NetiNeti: discovery of scientific names from text using machine learning methods , 2010, BMC Bioinformatics.

[10]  Martin Odersky,et al.  Parser combinators in Scala , 2008 .

[11]  Martin Odersky,et al.  An Overview of the Scala Programming Language , 2004 .

[12]  Tim Bray,et al.  Internet Engineering Task Force (ietf) the Javascript Object Notation (json) Data Interchange Format , 2022 .

[13]  David J. Patterson,et al.  uBioRSS: Tracking taxonomic literature using RSS , 2007, Bioinform..

[14]  Dmitry Mozzherin gbifparser: First release for gnparser paper , 2015 .

[15]  Eugene Burmako,et al.  Scala macros: let our powers combine!: on how rich syntax and static types work with metaprogramming , 2013, SCALA@ECOOP.

[16]  Thomas A Kluyver,et al.  Taxonome: a software package for linking biological species data , 2013, Ecology and evolution.

[17]  Zhenyuan Lu,et al.  The taxonomic name resolution service: an online tool for automated standardization of plant names , 2013, BMC Bioinformatics.

[18]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[19]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[20]  Peter R.J. Asveld,et al.  A Fuzzy Approach to Erroneous Inputs in Context-Free Language Recognition , 1995, IWPT.

[21]  Klemens Böhm,et al.  A combining approach to Find All taxon names (FAT) in legacy biosystematics literature , 2006 .

[22]  Bryan Ford,et al.  Parsing expression grammars: a recognition-based syntactic foundation , 2004, POPL '04.