Learning-Based Fusion for Data Deduplication

Rule-based deduplication utilizes expert domain knowledge to identify and remove duplicate data records. Achieving high accuracy in a rule-based system requires the creation of rules containing a good combination of discriminatory clues. Unfortunately, accurate rule-based deduplication often requires significant manual tuning of both the rules and the corresponding thresholds. This need for manual tuning reduces the efficacy of rule-based deduplication and its applicability to real-world data sets. No adequate solution exists for this problem. We propose a novel technique for rule-based deduplication. We apply individual deduplication rules, and combine the resultant match scores via learning-based information fusion. We show empirically that our fused deduplication technique achieves higher average accuracy than traditional rule-based deduplication. Further, our technique alleviates the need for manual tuning of the deduplication rules and corresponding thresholds.

[1]  Brant C. White,et al.  United States patent , 1985 .

[2]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[3]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[4]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[5]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[7]  Marcos André Gonçalves,et al.  Replica identification using genetic programming , 2008, SAC '08.

[8]  Esko Ukkonen,et al.  Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[9]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[11]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[12]  Arun Ross,et al.  Multibiometric systems , 2004, CACM.

[13]  Stuart E. Madnick,et al.  The inter-database instance identification problem in integrating autonomous systems , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[14]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[15]  Peter Christen,et al.  Towards Automated Record Linkage , 2006, AusDM.