IKMC: An Improved K-Medoids Clustering Method for Near-Duplicated Records Detection

An improved K-medoids clustering algorithm (IKMC) to resolve the problem of detecting the near-duplicated records is proposed in this paper. It considers every record in database as one separate data object, uses edit-distance method and the weights of attributes to get similarity value among records, then detect duplicated records by clustering these similarity value. This algorithm can automatically adjust the number of clusters through comparing the similarity value with the preset similarity threshold, and avoid a large numbers of I/O operations used by traditional "sort/merge" algorithm for sequencing. Through the experiment, this algorithm is proved to have good detection accuracy and high availability. Keywords-Near-duplicated; record; K-medoids clustering; Edit-distance