Gene symbol disambiguation using knowledge-based profiles

MOTIVATION The ambiguity of biomedical entities, particularly of gene symbols, is a big challenge for text-mining systems in the biomedical domain. Existing knowledge sources, such as Entrez Gene and the MEDLINE database, contain information concerning the characteristics of a particular gene that could be used to disambiguate gene symbols. RESULTS For each gene, we create a profile with different types of information automatically extracted from related MEDLINE abstracts and readily available annotated knowledge sources. We apply the gene profiles to the disambiguation task via an information retrieval method, which ranks the similarity scores between the context where the ambiguous gene is mentioned, and candidate gene profiles. The gene profile with the highest similarity score is then chosen as the correct sense. We evaluated the method on three automatically generated testing sets of mouse, fly and yeast organisms, respectively. The method achieved the highest precision of 93.9% for the mouse, 77.8% for the fly and 89.5% for the yeast. AVAILABILITY The testing data sets and disambiguation programs are available at http://www.dbmi.columbia.edu/~hux7002/gsd2006

[1]  James Pustejovsky,et al.  Automatic Extraction of Acronym-meaning Pairs from MEDLINE Databases , 2001, MedInfo.

[2]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[3]  Andrew Harley,et al.  Sense Tagging in Action Combining Different Tests with Additive Weighangs , 2002 .

[4]  John G. Cleary,et al.  AZuRE, a scalable system for automated term disambiguation of gene and protein names , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[5]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[6]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[7]  Padmini Srinivasan,et al.  Gene Terms and English Words: An Ambiguous Mix , .

[8]  A Thesis Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation , 2003 .

[9]  O. J. Dunn Multiple Comparisons Using Rank Sums , 1964 .

[10]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[11]  A. Valencia,et al.  The success (or not) of HUGO nomenclature , 2006, Genome Biology.

[12]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[13]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[14]  Halil Kilicoglu,et al.  Word sense disambiguation by selecting the best semantic type based on Journal Descriptor Indexing: Preliminary experiment , 2006 .

[15]  Hongfang Liu,et al.  Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues , 2006, BMC Bioinformatics.

[16]  W. N. Locke,et al.  Machine Translation of Languages , 1956 .

[17]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[18]  Yorick Wilks,et al.  Providing machine tractable dictionary tools , 1990, Machine Translation.

[19]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[20]  Ralf Zimmer,et al.  Gene and protein nomenclature in public databases , 2006, BMC Bioinformatics.

[21]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[22]  Hongfang Liu,et al.  Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..

[23]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[24]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[25]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[26]  Christian Blaschke,et al.  Status of text-mining techniques applied to biomedical text. , 2006, Drug discovery today.

[27]  Carol Friedman,et al.  PhenoGO: Assigning Phenotypic Context to Gene Ontology Annotations with Natural Language Processing , 2005, Pacific Symposium on Biocomputing.

[28]  Martijn J. Schuemie,et al.  Thesaurus-based disambiguation of gene symbols , 2005, BMC Bioinformatics.

[29]  Paul Buitelaar,et al.  Unsupervised Monolingual and Bilingual Word-Sense Disambiguation of Medical Documents using UMLS , 2003, BioNLP@ACL.

[30]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[31]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[32]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .