Combining multiple evidence for gene symbol disambiguation

Gene names and symbols are important biomedical entities, but are highly ambiguous. This ambiguity affects the performance of both information extraction and information retrieval systems in the biomedical domain. Existing knowledge sources contain different types of information about genes and could be used to disambiguate gene symbols. In this paper, we applied an information retrieval (IR) based method for human gene symbol disambiguation and studied different methods to combine various types of information from available knowledge sources. Results showed that a combination of evidence usually improved performance. The combination method using coefficients obtained from a logistic regression model reached the highest precision of 92.2% on a testing set of ambiguous human gene symbols.

[1]  R. Fildes Journal of the American Statistical Association : William S. Cleveland, Marylyn E. McGill and Robert McGill, The shape parameter for a two variable graph 83 (1988) 289-300 , 1989 .

[2]  Joon Ho Lee,et al.  Combining multiple evidence from different properties of weighting schemes , 1995, SIGIR '95.

[3]  Hongfang Liu,et al.  Gene name ambiguity of eukaryotic nomenclatures , 2005, Bioinform..

[4]  Edward A. Fox,et al.  Coefficients of combining concept classes in a collection , 1988, SIGIR '88.

[5]  Janyce Wiebe,et al.  Word-Sense Disambiguation Using Decomposable Models , 1994, ACL.

[6]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[7]  BMC Bioinformatics , 2005 .

[8]  A. Dunker The pacific symposium on biocomputing , 1998 .

[9]  D. Steinberg,et al.  Technometrics , 2008 .

[10]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[11]  W. Bruce Croft,et al.  Evaluation of an inference network-based retrieval model , 1991, TOIS.

[12]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[13]  John G. Cleary,et al.  AZuRE, a scalable system for automated term disambiguation of gene and protein names , 2004 .

[14]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[15]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[16]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[17]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[18]  Martijn J. Schuemie,et al.  Distribution of information in biomedical abstracts and full-text publications , 2004, Bioinform..

[19]  Ralph A. Szweda,et al.  Information processing management , 1972 .