Adaptive Windows for Duplicate Detection

Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose with the Duplicate Count Strategy (DCS) a variation of SNM that uses a varying window size. It is based on the intuition that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. Next to the basic variant of DCS, we also propose and thoroughly evaluate a variant called DCS++ which is provably better than the original SNM in terms of efficiency (same results with fewer comparisons).

[1]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[2]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[3]  C. Lee Giles,et al.  Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.

[4]  Peter Christen,et al.  Probabilistic Data Generation for Deduplication and Data Linkage , 2005, IDEAL.

[5]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[6]  Hector Garcia-Molina,et al.  Evaluating entity resolution results , 2010, Proc. VLDB Endow..

[7]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[8]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[9]  Felix Naumann,et al.  Industry-scale duplicate detection , 2008, Proc. VLDB Endow..

[10]  Lifang Gu,et al.  Adaptive Filtering for Efficient Record Linkage , 2004, SDM.

[11]  Felix Naumann,et al.  DuDe: The Duplicate Detection Toolkit , 2010 .

[12]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[13]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[14]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[15]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[16]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[17]  Ahmad Abdollahzadeh Barforoush,et al.  A Flexible Fuzzy Expert System for Fuzzy Duplicate Elimination in Data Cleaning , 2004, DEXA.

[18]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[19]  Peter Christen,et al.  Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[20]  Felix Naumann,et al.  A Comparison and Generalization of Blocking and Windowing Algorithms for Duplicate Detection , 2009 .

[21]  Andreas Polze,et al.  Survey on Healthcare IT systems – Standards , Regulations and Security , 2011 .

[22]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[23]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[24]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.