Innovative Windows for Duplicate Detection

Duplicate detection is the special case of data matching that discovers groups of records within a single database that belong to same real world entity. It is also an inevitable part of data cleansing because duplicate records can strongly influence the results of later data mining or processing. In this process, one record is compared to all other records. Different data representations, formats, terminologies and data entry errors make this task complex. Involvement of heavy volume databases adds more complexity. To reduce comparison of records, indexing algorithms are used that partition the data and perform comparisons with in that partition. Sorted Neighborhood Method (SNM) is a standard indexing algorithm that sorts dataset by using defined “sorting key” and moves fixed size window to compare records within that window. Duplicate Count Strategy-Multi record increase (DCS++) is latest improvement in SNM that adapts the window size for every duplicate in current window. We propose Innovative Windows (Inn Win) algorithm that assumes i) detected duplicate in sorted dataset raises the probability of finding more duplicates in neighborhood ii) Series of consecutive non-duplicates drops the probability of duplicates in neighborhood. Using this concept, it adapts window both for duplicates and non-duplicates and avoids unnecessary comparisons without losing effectiveness. We prove that Inn Win is a better alternative in windowing algorithms.

[1]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[2]  Ian H. Witten,et al.  Managing gigabytes 2nd edition , 1999 .

[3]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[4]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[5]  Peter Christen,et al.  Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[6]  Pedro M. Domingos,et al.  Object Identification with Attribute-Mediated Dependences , 2005, PKDD.

[7]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[8]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[9]  Felix Naumann,et al.  A Comparison and Generalization of Blocking and Windowing Algorithms for Duplicate Detection , 2009 .

[10]  Hector Garcia-Molina,et al.  Evaluating entity resolution results , 2010, Proc. VLDB Endow..

[11]  Peter Christen Towards Parameter-free Blocking for Scalable Record Linkage , 2007 .

[12]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[13]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[14]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[15]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[16]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[17]  C. Lee Giles,et al.  Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.

[18]  Felix Naumann,et al.  DuDe: The Duplicate Detection Toolkit , 2010 .

[19]  Felix Naumann,et al.  Adaptive Windows for Duplicate Detection , 2012, 2012 IEEE 28th International Conference on Data Engineering.