Thesaurus-based disambiguation of gene symbols

BackgroundMassive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.ResultsWe developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set.ConclusionThe ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.

[1]  Michael Y. Galperin The Molecular Biology Database Collection: 2005 update , 2004, Nucleic Acids Res..

[2]  Tapio Salakoski,et al.  New Techniques for Disambiguation in Natural Language and Their Application to Biological Text , 2004, J. Mach. Learn. Res..

[3]  Erik M. van Mulligen,et al.  Ambiguity of Human Gene Symbols in LocusLink and MEDLINE: Creating an Inventory and a Disambiguation Test Collection , 2003, AMIA.

[4]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[5]  K. Bretonnel Cohen,et al.  Contrast and variability in gene names , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[6]  S. Povey,et al.  The changing challenges of nomenclature , 1999, Cytogenetic and Genome Research.

[7]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[8]  Obstacles of nomenclature , 1997, Nature.

[9]  Erik M. van Mulligen,et al.  Facilitating networks of information , 2000, AMIA.

[10]  William R. Hersh,et al.  Information Retrieval: A Health and Biomedical Perspective , 2002 .

[11]  C. Blaschke,et al.  The potential use of SUISEKI as a protein interaction discovery tool. , 2001, Genome informatics. International Conference on Genome Informatics.

[12]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[13]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[14]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[15]  Hongfang Liu,et al.  Research Paper: A Multi-aspect Comparison Study of Supervised Word Sense Disambiguation , 2004, J. Am. Medical Informatics Assoc..

[16]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[17]  A. Valencia,et al.  A gene network for navigating the literature , 2004, Nature Genetics.

[18]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[19]  Yorick Wilks,et al.  The Interaction of Knowledge Sources in Word Sense Disambiguation , 2001, CL.

[20]  Hagit Shatkay,et al.  Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis , 2000, ISMB.

[21]  Paul Buitelaar,et al.  Unsupervised Monolingual and Bilingual Word-Sense Disambiguation of Medical Documents using UMLS , 2003, BioNLP@ACL.

[22]  Hongfang Liu,et al.  Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method , 2001, J. Biomed. Informatics.

[23]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[24]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[25]  P Bork,et al.  Automated extraction of information in molecular biology , 2000, FEBS letters.

[26]  John G. Cleary,et al.  AZuRE, a scalable system for automated term disambiguation of gene and protein names , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[27]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[28]  Jonathan D. Wren,et al.  Knowledge discovery by automated identification and ranking of implicit relationships , 2004, Bioinform..

[29]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[30]  Hongfang Liu,et al.  Pacific Symposium on Biocomputing 9:238-249(2004) BIOLOGICAL NOMENCLATURES: A SOURCE OF LEXICAL KNOWLEDGE AND AMBIGUITY , 2022 .

[31]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[32]  Michael Gribskov,et al.  Use of keyword hierarchies to interpret gene expression patterns , 2001, Bioinform..

[33]  H. Lowe,et al.  Understanding and using the medical subject headings (MeSH) vocabulary to perform literature searches. , 1994, JAMA.

[34]  Hongfang Liu,et al.  Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..

[35]  Mikhail V. Blagosklonny,et al.  Conceptual biology: Unearthing the gems , 2002, Nature.