A Threshold-Based Combination of String and Semantic Similarity Measures for Record Linkage
暂无分享,去创建一个
Since integrated data have got richer information, integration of different data sources is a key step in most data warehousing and mining projects. One of the principal
challenges in integrating databases is duplication. In other words, in different databases, one entity may be available in different formats. Therefore, when these databases are combined, the availability of entities in different formats causes duplication. Record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. A variety of record linkage models with different steps have been developed in order to detect such duplicate records. For this purpose, string similarity measures are widely utilized for comparing record-pairs in different studies. However, in addition to string similarity, considering the semantic relatedness between two records can be also beneficial in the process of detecting duplicate records.
This issue is not regarded in existing record linkage models.
To determine the importance of semantic similarity in improving the effectiveness of detecting duplicate records, a similarity measure based on the combination of string and
semantic similarity measures is proposed in this study. For combination purpose, a threshold-based method which considers the semantic similarity for each field of the
dataset is proposed. This threshold determines the influence of semantic similarity in the final combination algorithm. The combined similarity measure is experimented on two real world datasets, namely Restaurant and Cora and its effectiveness is measured based on several standard evaluation metrics. As experimental results indicate, the combined similarity measure which is based on the combination of string and semantic similarity measures outperforms the string and semantic similarity measures, which are used individually, with the F-measure of 99.1% in Restaurant dataset, and 88.3% in Cora dataset. Therefore, based on the experimental results, semantic similarity should be taken into account in addition to string similarity in order to detect duplicate records more effectively in recork linkage