Ambiguity of Human Gene Symbols in LocusLink and MEDLINE: Creating an Inventory and a Disambiguation Test Collection

Genes are discovered almost on a daily basis and new names have to be found. Although there are guidelines for gene nomenclature, the naming process is highly creative. Human genes are often named with a gene symbol and a longer, more descriptive term; the short form is very often an abbreviation of the long form. Abbreviations in biomedical language are highly ambiguous, i.e., one gene symbol often refers to more than one gene. Using an existing abbreviation expansion algorithm,we explore MEDLINE for the use of human gene symbols derived from LocusLink. It turns out that just over 40% of these symbols occur in MEDLINE, however, many of these occurrences are not related to genes. Along the process of making an inventory, a disambiguation test collection is constructed automatically.

[1]  Hongfang Liu,et al.  Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..

[2]  Hongfang Liu,et al.  A study of abbreviations in the UMLS , 2001, AMIA.

[3]  Hongfang Liu,et al.  Mining Terminological Knowledge in Large Biomedical Corpora , 2003, Pacific Symposium on Biocomputing.

[4]  N E Morton,et al.  International System for Human Gene Nomenclature (1979) ISGN (1979). , 1980, Birth defects original article series.

[5]  Paul Buitelaar,et al.  Evaluation Corpora for Sense Disambiguation in the Medical Domain , 2002, LREC.

[6]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[7]  Sue Povey,et al.  Genew: the Human Gene Nomenclature Database , 2002, Nucleic Acids Res..

[8]  Hongfang Liu,et al.  A study of abbreviations in MEDLINE abstracts , 2002, AMIA.

[9]  Hongfang Liu,et al.  Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method , 2001, J. Biomed. Informatics.

[10]  Russ B. Altman,et al.  Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE , 2002, J. Am. Medical Informatics Assoc..

[11]  A T McCray,et al.  The Nature of Lexical Knowledge , 1998, Methods of Information in Medicine.

[12]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[13]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[14]  Carol Friedman,et al.  Automatic extraction of gene and protein synonyms from MEDLINE and journal articles , 2002, AMIA.

[15]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[16]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.