String Edit Analysis for Merging Databases

The first step prior to data mining is often to merge databases from different sources. Entries in these databases or descriptions retrieved using information extraction. may use significantly different vocabularies, so one often needs to determine whether similar descriptions refer to the same item or to different items (e.g., people or goods). String edit distance is an elegant way of defining the degree of similarity between entries and can be efficiently computed using dynamic programming (Ristad and Yianilos, 1977). However, in order to achieve reasonable accuracy, most real problems require the use of extended sets of edit rules with associated costs that are tuned specifically to each data set. We present a flexible approach to string edit distance, which can be automatically tuned to different data sets and can use synonym dictionaries. Dynamic programming is used to calculate the edit distance between a pair of strings based on a set of string edit rules including a new edit rule that allows words and phrases to be deleted or substituted. A genetic algorithm is used to learn costs corresponding to each edit rule based on a small set of labeled training data. Deleting contentless words like "method" and substituting synonyms such as "ibuprofen" for "Motrin" significantly increases the algorithm’s accuracy (from 80% to 90% on a difficult sample medical data set), when costs are correctly tuned. This string edit-based matching tool is easily adapted for a variety of different cases when one needs to recognize which text strings from different information sources refer to the same item such as a person, address, medical procedure or product .