Ambiguity and variability of database and software names in bioinformatics

BackgroundThere are numerous options available to achieve various tasks in bioinformatics, but until recently, there were no tools that could systematically identify mentions of databases and tools within the literature. In this paper we explore the variability and ambiguity of database and software name mentions and compare dictionary and machine learning approaches to their identification.ResultsThrough the development and analysis of a corpus of 60 full-text documents manually annotated at the mention level, we report high variability and ambiguity in database and software mentions. On a test set of 25 full-text documents, a baseline dictionary look-up achieved an F-score of 46 %, highlighting not only variability and ambiguity but also the extensive number of new resources introduced. A machine learning approach achieved an F-score of 63 % (with precision of 74 %) and 70 % (with precision of 83 %) for strict and lenient matching respectively. We characterise the issues with various mention types and propose potential ways of capturing additional database and software mentions in the literature.ConclusionsOur analyses show that identification of mentions of databases and tools is a challenging task that cannot be achieved by relying on current manually-curated resource repositories. Although machine learning shows improvement and promise (primarily in precision), more contextual information needs to be taken into account to achieve a good degree of accuracy.

[1]  Ulf Leser,et al.  What makes a gene name? Named entity recognition in the biomedical literature , 2005, Briefings Bioinform..

[2]  Malvina Nissim,et al.  A System for Identifying Named Entities in Biomedical Text: how Results From two Evaluations Reflect on Both the System and the Evaluations , 2005, Comparative and functional genomics.

[3]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[4]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[5]  Robert Stevens,et al.  Performing in silico Experiments on the Grid : A Users Perspective , 2003 .

[6]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[7]  David Wheeler,et al.  Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils) , 2004 .

[8]  David L. Robertson,et al.  Methodology capture: discriminating between the "best" and the rest of community practice , 2008, BMC Bioinformatics.

[9]  Kevin Thornton,et al.  libsequence: a C++ class library for evolutionary genetic analysis , 2003, Bioinform..

[10]  D. Rebholz-Schuhmann,et al.  Journal of Biomedical Semantics , 2017 .

[11]  J. Silberg,et al.  A transposase strategy for creating libraries of circularly permuted proteins , 2012, Nucleic acids research.

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Neil R. Smalheiser,et al.  ADAM: another database of abbreviations in MEDLINE , 2006, Bioinform..

[14]  Cynthia S. Gadd,et al.  The Online Bioinformatics Resources Collection at the University of Pittsburgh Health Sciences Library System—a one-stop gateway to online bioinformatics databases and software tools , 2006, Nucleic Acids Res..

[15]  Michelle D. Brazas,et al.  The 2011 bioinformatics links directory update: more resources, tools and databases and features to empower the bioinformatics community , 2011, Nucleic Acids Res..

[16]  Michael Y. Galperin,et al.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection , 2011, Nucleic Acids Res..

[17]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[18]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[19]  R. J. Roberts PubMed Central: The GenBank of the published literature. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[21]  Vassilios Ioannidis,et al.  ExPASy: SIB bioinformatics resource portal , 2012, Nucleic Acids Res..

[22]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[23]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[24]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[25]  Goran Nenadic,et al.  Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives , 2013, J. Am. Medical Informatics Assoc..

[26]  Kalina Bontcheva,et al.  Text Processing with GATE , 2011 .

[27]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[28]  Robert Stevens,et al.  Extracting patterns of database and software usage from the bioinformatics literature , 2014, Bioinform..

[29]  Ed Anan Shetty,et al.  Literature , 1965, Science.

[30]  Robert Stevens,et al.  bioNerDS: exploring bioinformatics’ database and software use through literature mining , 2013, BMC Bioinformatics.

[31]  Miguel García-Remesal,et al.  BIRI: a new approach for automatically discovering and indexing available public bioinformatics resources from the literature , 2009, BMC Bioinformatics.

[32]  Goran Nenadic,et al.  Mining methodologies from NLP publications: A case study in automatic terminology recognition , 2012, Comput. Speech Lang..

[33]  Hongfang Liu,et al.  A comparison study on algorithms of detecting long forms for short forms in biomedical text , 2007, BMC Bioinformatics.

[34]  Yasunori Yamamoto,et al.  OReFiL: an online resource finder for life sciences , 2007, BMC Bioinformatics.

[35]  The UniProt Consortium,et al.  Reorganizing the protein space at the Universal Protein Resource (UniProt) , 2011, Nucleic Acids Res..

[36]  Goran Nenadic,et al.  Ambiguity and variability of database and software names in bioinformatics , 2012 .

[37]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[38]  Chris Greenhalgh,et al.  Performing \emph{In Silico} Experiments on the Grid: A Users' Perspective , 2003 .

[39]  Sophia Ananiadou,et al.  Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry , 2011, PloS one.

[40]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[41]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.

[42]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[43]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[44]  J-D Kim,et al.  Corpora and their Annotation , 2006 .