Effective Incremental Clustering for Duplicate Detection in Large Databases

We propose an incremental algorithm for discovering clusters of duplicate tuples in large databases. The core of the approach is the usage of an indexing technique which, for any newly arrived tuple mu, allows to efficiently retrieve a set of tuples in the database which are mostly similar to mu, and which are likely to refer to the same real-world entity which is associated with mu. The proposed index is based on a hashing approach which tends to assign similar objects to the same buckets. Empirical and analytical evaluation demonstrates that the proposed approach achieves satisfactory efficiency results, at the cost of low accuracy loss

[1]  Mattis Neiling,et al.  The Object Identification Framework , 2003 .

[2]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[3]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[4]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[5]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[6]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[7]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[8]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[9]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[10]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[11]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[12]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[13]  Eugenio Cesario,et al.  An incremental clustering scheme for duplicate detection in large databases , 2005, 9th International Database Engineering & Application Symposium (IDEAS'05).

[14]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[17]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[18]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[19]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[20]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[21]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[22]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[23]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[24]  Dimitrios Gunopulos,et al.  Efficient and tumble similar set retrieval , 2001, SIGMOD '01.

[25]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[26]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[27]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[28]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[29]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.