Which species is it? Species-driven gene name disambiguation using random walks over a mixture of adjacency matrices

MOTIVATION The scientific literature contains a wealth of information about biological systems. Manual curation lacks the scalability to extract this information due to the ever-increasing numbers of papers being published. The development and application of text mining technologies has been proposed as a way of dealing with this problem. However, the inter-species ambiguity of the genomic nomenclature makes mapping of gene mentions identified in text to their corresponding Entrez gene identifiers an extremely difficult task. We propose a novel method, which transforms a MEDLINE record into a mixture of adjacency matrices; by performing a random walkover the resulting graph, we can perform multi-class supervised classification allowing the assignment of taxonomy identifiers to individual gene mentions. The ability to achieve good performance at this task has a direct impact on the performance of normalizing gene mentions to Entrez gene identifiers. Such graph mixtures add flexibility and allow us to generate probabilistic classification schemes that naturally reflect the uncertainties inherent, even in literature-derived data. RESULTS Our method performs well in terms of both micro- and macro-averaged performance, achieving micro-F(1) of 0.76 and macro-F(1) of 0.36 on the publicly available DECA corpus. Re-curation of the DECA corpus was performed, with our method achieving 0.88 micro-F(1) and 0.51 macro-F(1). Our method improves over standard classification techniques [such as support vector machines (SVMs)] in a number of ways: flexibility, interpretability and its resistance to the effects of class bias in the training data. Good performance is achieved without the need for computationally expensive parse tree generation or 'bag of words classification'.

[1]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[2]  Xiaoyan Zhu,et al.  GeneTUKit: a software for document-level gene normalization , 2011, Bioinform..

[3]  Fabio Rinaldi,et al.  TX Task: Automatic Detection of Focus Organisms in Biomedical Publications , 2009, BioNLP@HLT-NAACL.

[4]  Mark Johnston,et al.  Whither Model Organism Research? , 2005, Science.

[5]  Christian Blaschke,et al.  Text Mining for Metabolic Pathways, Signaling Cascades, and Protein Networks , 2005, Science's STKE.

[6]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  K. Bretonnel Cohen,et al.  U-Compare: share and compare text mining tools with UIMA , 2009, Bioinform..

[8]  René Witte,et al.  OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents , 2011, Bioinform..

[9]  Indra Neil Sarkar,et al.  Taxongrab: Extracting Taxonomic Names from Text , 2005 .

[10]  W. Kintsch The role of knowledge in discourse comprehension: a construction-integration model. , 1988, Psychological review.

[11]  Ulf Leser,et al.  Finding kinetic parameters using text mining. , 2004, Omics : a journal of integrative biology.

[12]  Sophia Ananiadou,et al.  Disambiguating the species of biomedical named entities using natural language parsers , 2010, Bioinform..

[13]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[14]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[15]  Goran Nenadic,et al.  The GNAT library for local and remote gene mention normalization , 2011, Bioinform..

[16]  Xinglong Wang,et al.  Distinguishing the species of biomedical named entities for term identification , 2008, BMC Bioinformatics.

[17]  Richárd Farkas,et al.  The strength of co-authorship in gene name disambiguation , 2008, BMC Bioinformatics.

[18]  R. Guralnick,et al.  Biodiversity informatics: automated approaches for documenting global biodiversity patterns and processes , 2009, Bioinform..

[19]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[20]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[21]  Wendy Filsell,et al.  What the papers say: Text mining for genomics and systems biology , 2010, Human Genomics.

[22]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[23]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[24]  Barend Mons,et al.  Which gene did you mean? , 2005, BMC Bioinformatics.

[25]  David J. States,et al.  A bioinformatics analysis of the cell line nomenclature , 2008, Bioinform..

[26]  Mark Johnston,et al.  Cell biology. Whither model organism research? , 2005, Science.

[27]  Christophe Roeder,et al.  Exploring Species-Based Strategies for Gene Normalization , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[29]  Frank Harary,et al.  Graph Theory , 2016 .

[30]  Paolo Romano,et al.  Cell Line Data Base: structure and recent improvements towards molecular authentication of human cell lines , 2008, Nucleic Acids Res..

[31]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.