Learning string similarity measures for gene/protein name dictionary look-up using logistic regression

MOTIVATION One of the bottlenecks of biomedical data integration is variation of terms. Exact string matching often fails to associate a name with its biological concept, i.e. ID or accession number in the database, due to seemingly small differences of names. Soft string matching potentially enables us to find the relevant ID by considering the similarity between the names. However, the accuracy of soft matching highly depends on the similarity measure employed. RESULTS We used logistic regression for learning a string similarity measure from a dictionary. Experiments using several large-scale gene/protein name dictionaries showed that the logistic regression-based similarity measure outperforms existing similarity measures in dictionary look-up tasks. AVAILABILITY A dictionary look-up system using the similarity measures described in this article is available at http://text0.mib.man.ac.uk/software/mldic/.

[1]  Fernando Pereira,et al.  Automatically annotating documents with normalized gene lists , 2005, BMC Bioinformatics.

[2]  Lawrence H. Smith,et al.  Identification of related gene/protein names based on an HMM of name variations , 2004, Comput. Biol. Chem..

[3]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[4]  Alexander A. Morgan,et al.  Gene name identification and normalization using a model organism database , 2004, J. Biomed. Informatics.

[5]  Sugato Basu,et al.  Adaptive product normalization: using online learning for record linkage in comparison shopping , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  K. Bretonnel Cohen,et al.  Contrast and variability in gene names , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[7]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[8]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[9]  Jun'ichi Tsujii,et al.  Improving the performance of dictionary-based approaches in protein name recognition , 2004, J. Biomed. Informatics.

[10]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[11]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[13]  Sophia Ananiadou,et al.  MaSTerClass: a case-based reasoning system for the classification of biomedical terms , 2005, Bioinform..

[14]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[15]  Jun'ichi Tsujii,et al.  Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases , 2006, ACL.

[16]  James Pustejovsky,et al.  Adaptive String Similarity Metrics for Biomedical Reference Resolution , 2005, LBLODMBS@IDMB.

[17]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[18]  William W. Cohen,et al.  A graph-search framework for associating gene identifiers with documents , 2006, BMC Bioinformatics.

[19]  Andrew McCallum,et al.  A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance , 2005, UAI.

[20]  Yang Jin,et al.  Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries , 2006, BioNLP@NAACL-HLT.

[21]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[22]  Toshihisa Takagi,et al.  Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. , 2003, Genome research.