eGIFT: Mining Gene Information from the Literature

BackgroundWith the biomedical literature continually expanding, searching PubMed for information about specific genes becomes increasingly difficult. Not only can thousands of results be returned, but gene name ambiguity leads to many irrelevant hits. As a result, it is difficult for life scientists and gene curators to rapidly get an overall picture about a specific gene from documents that mention its names and synonyms.ResultsIn this paper, we present eGIFT (http://biotm.cis.udel.edu/eGIFT), a web-based tool that associates informative terms, called i Terms, and sentences containing them, with genes. To associate i Terms with a gene, eGIFT ranks i Terms about the gene, based on a score which compares the frequency of occurrence of a term in the gene's literature to its frequency of occurrence in documents about genes in general. To retrieve a gene's documents (Medline abstracts), eGIFT considers all gene names, aliases, and synonyms. Since many of the gene names can be ambiguous, eGIFT applies a disambiguation step to remove matches that do not correspond to this gene. Another additional filtering process is applied to retain those abstracts that focus on the gene rather than mention it in passing. eGIFT's information for a gene is pre-computed and users of eGIFT can search for genes by using a name or an EntrezGene identifier. i Terms are grouped into different categories to facilitate a quick inspection. eGIFT also links an i Term to sentences mentioning the term to allow users to see the relation between the i Term and the gene. We evaluated the precision and recall of eGIFT's i Terms for 40 genes; between 88% and 94% of the i Terms were marked as salient by our evaluators, and 94% of the UniProtKB keywords for these genes were also identified by eGIFT as i Terms.ConclusionsOur evaluations suggest that i Terms capture highly-relevant aspects of genes. Furthermore, by showing sentences containing these terms, eGIFT can provide a quick description of a specific gene. eGIFT helps not only life scientists survey results of high-throughput experiments, but also annotators to find articles describing gene aspects and functions.

[1]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[2]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[3]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[4]  A. Valencia,et al.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge , 2008, Genome Biology.

[5]  Miguel A. Andrade-Navarro,et al.  Update on XplorMed: a web server for exploring scientific literature , 2003, Nucleic Acids Res..

[6]  Dietrich Rebholz-Schuhmann,et al.  BIOINFORMATICS ORIGINAL PAPER Data and text mining Resolving abbreviations to their senses in Medline , 2005 .

[7]  Thomas Werner,et al.  LitMiner and WikiGene: identifying problem-related key players of gene regulation using publication abstracts , 2005, Nucleic Acids Res..

[8]  M. Schuemie,et al.  Anni 2.0: a multipurpose text-mining tool for the life sciences , 2008, Genome Biology.

[9]  Jun Xu,et al.  GeneNarrator: Mining the Literaturome for Relations Among Genes , 2009 .

[10]  Shamkant B. Navathe,et al.  Text Mining Functional Keywords Associated with Genes , 2004, MedInfo.

[11]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[12]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[13]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[14]  Piotr Zielenkiewicz,et al.  The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words , 2009, BMC Bioinformatics.

[15]  Michael Boehnke,et al.  Evaluation of genome-wide association study results through development of ontology fingerprints , 2009, Bioinform..

[16]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[17]  David S. Wishart,et al.  Nucleic Acids Research Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs and Metabolites , 2008 .

[18]  H. Shatkey,et al.  Finding themes in Medline documents - probabilistic similarity search , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[19]  Dietrich Rebholz-Schuhmann,et al.  EBIMed - text crunching to gather facts for proteins from Medline , 2007, Bioinform..

[20]  Piotr Zielenkiewicz,et al.  e-LiSe - an online tool for finding needles in the "(Medline) haystack" , 2008, Bioinform..

[21]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[22]  Ralf Zimmer,et al.  Gene and protein nomenclature in public databases , 2006, BMC Bioinformatics.

[23]  K. Bretonnel Cohen,et al.  Contrast and variability in gene names , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[24]  Dietrich Rebholz-Schuhmann,et al.  MedEvi: Retrieving textual evidence of relations between biomedical concepts from Medline , 2008, Bioinform..

[25]  J McEntyre,et al.  PubMed: bridging the information gap. , 2001, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[26]  Sophia Ananiadou,et al.  FACTA: a text search engine for finding associated biomedical concepts , 2008, Bioinform..

[27]  Vetle I. Torvik,et al.  Anne O'Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results , 2008, Journal of biomedical discovery and collaboration.

[28]  Hagit Shatkay,et al.  Finding Themes in Medline Documents: Probabilistic Similarity Search , 2000, ADL.

[29]  Manabu Torii,et al.  Building Domain-Specific Taggers without Annotated (Domain) Data , 2007, EMNLP.

[30]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.