NetiNeti: discovery of scientific names from text using machine learning methods

BackgroundA scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information.ResultsWe present NetiNeti (Name Extraction from Textual Information-Name Extraction for Taxonomic Indexing), a machine learning based approach for recognition of scientific names including the discovery of new species names from text that will also handle misspellings, OCR errors and other variations in names. The system generates candidate names using rules for scientific names and applies probabilistic machine learning methods to classify names based on structural features of candidate names and features derived from their contexts. NetiNeti can also disambiguate scientific names from other names using the contextual information. We evaluated NetiNeti on legacy biodiversity texts and biomedical literature (MEDLINE). NetiNeti performs better (precision = 98.9% and recall = 70.5%) compared to a popular dictionary based approach (precision = 97.5% and recall = 54.3%) on a 600-page biodiversity book that was manually marked by an annotator. On a small set of PubMed Central’s full text articles annotated with scientific names, the precision and recall values are 98.5% and 96.2% respectively. NetiNeti found more than 190,000 unique binomial and trinomial names in more than 1,880,000 PubMed records when used on the full MEDLINE database. NetiNeti also successfully identifies almost all of the new species names mentioned within web pages.ConclusionsWe present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed athttp://namefinding.ubio.org

[1]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[2]  Hoifung Poon,et al.  Joint Inference for Knowledge Extraction from Biomedical Literature , 2010, NAACL.

[3]  Naoaki Okazaki,et al.  Data and text mining Building an abbreviation dictionary using a term recognition approach , 2006 .

[4]  Raymond Lau,et al.  Adaptive statistical language modeling , 1994 .

[5]  Ulf Leser,et al.  ALIBABA: PubMed as a graph , 2006, Bioinform..

[6]  Xinglong Wang,et al.  Distinguishing the species of biomedical named entities for term identification , 2008, BMC Bioinformatics.

[7]  Catherine N. Norton,et al.  Taxonomic indexing--extending the role of taxonomy. , 2006, Systematic biology.

[8]  R. Guralnick,et al.  Biodiversity informatics: automated approaches for documenting global biodiversity patterns and processes , 2009, Bioinform..

[9]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[10]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[11]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[12]  Thomas J. Watson,et al.  An empirical study of the naive Bayes classifier , 2001 .

[13]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[14]  Fabio Rinaldi,et al.  TX Task: Automatic Detection of Focus Organisms in Biomedical Publications , 2009, BioNLP@HLT-NAACL.

[15]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[16]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[17]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[18]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[19]  C. Marshall Encyclopedia of Life , 2008 .

[20]  Sophia Ananiadou,et al.  Disambiguating the species of biomedical named entities using natural language parsers , 2010, Bioinform..

[21]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[22]  David J. Patterson,et al.  uBioRSS: Tracking taxonomic literature using RSS , 2007, Bioinform..

[23]  Claire Grover,et al.  Learning the Species of Biomedical Named Entities from Annotated Corpora , 2008, LREC.

[24]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.

[25]  Klemens Böhm,et al.  A combining approach to Find All taxon names (FAT) in legacy biosystematics literature , 2006 .

[26]  Roderic D. M. Page,et al.  TBMap: a taxonomic perspective on the phylogenetic database TreeBASE , 2007, BMC Bioinformatics.

[27]  D J Patterson,et al.  Names are key to the big new biology. , 2010, Trends in ecology & evolution.

[28]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[29]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Hwee Tou Ng,et al.  Named Entity Recognition: A Maximum Entropy Approach Using Global Information , 2002, COLING.

[31]  Indra Neil Sarkar,et al.  Taxongrab: Extracting Taxonomic Names from Text , 2005 .

[32]  Indra Neil Sarkar,et al.  Biodiversity informatics: organizing and linking information across the spectrum of life , 2007, Briefings Bioinform..

[33]  Darren J. Wilkinson,et al.  CaliBayes: Integration of GRID based simulation and data resources for Bayesian calibration of biological models , 2005, BMC Bioinformatics.

[34]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[35]  Samuel G. Goodrich,et al.  A pictorial geography of the world , 1849 .

[36]  Terrence S. Furey,et al.  A computational screen for site selective A-to-I editing detects novel sites in neuron specific Hu proteins , 2010, BMC Bioinformatics.

[37]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[38]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[39]  Guido van Rossum,et al.  Python Programming Language , 2007, USENIX Annual Technical Conference.

[40]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.