Discriminative Optimization of String Similarity and Its Application to Biomedical Abbreviation Clustering

Many string similarity measures have been developed to deal with the variety of expressions in natural language texts. With the abundance of such measures, we should consider the choice of measures and its parameters to maximize the performance for a given task. During our preliminary experiment to find the best measure and its parameters for the task of clustering terms to improve our abbreviation dictionary in life science, we found that chemical names had different characteristics in their character sequences compared to other terms. Based on the observation, we experimented with four string similarity measures to test the hypothesis, gchemical names has a different morphology, thus computation of their similarity should be differed from that of other terms.h The experimental results show that the edit distance is the best for chemical names, and that the discriminative application of string similarity methods to chemical and non-chemical names may be a simple but effective way to improve the performance of term clustering.

[1]  Naoaki Okazaki,et al.  Building a high-quality sense inventory for improved abbreviation disambiguation , 2010, Bioinform..

[2]  Graeme Hirst,et al.  Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model , 2008, CICLing.

[3]  Hermann Ney,et al.  Algorithms for bigram and trigram word clustering , 1995, Speech Commun..

[4]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[5]  B L Lambert,et al.  Predicting look-alike and sound-alike medication errors. , 1997, American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists.

[6]  Dietrich Rebholz-Schuhmann,et al.  BIOINFORMATICS ORIGINAL PAPER Data and text mining Resolving abbreviations to their senses in Medline , 2005 .

[7]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[8]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  Neil R. Smalheiser,et al.  ADAM: another database of abbreviations in MEDLINE , 2006, Bioinform..

[11]  Yasunori Yamamoto,et al.  Allie: a database and a search service of abbreviations and long forms , 2011, Database J. Biol. Databases Curation.

[12]  Peter Murray-Rust,et al.  High-Throughput Identification of Chemistry in Life Science Texts , 2006, CompLife.

[13]  Peter Willett,et al.  Automatic Spelling Correction Using a Trigram Similarity Measure , 1983, Inf. Process. Manag..