论文信息 - Learning-Based Fusion for Data Deduplication

Learning-Based Fusion for Data Deduplication

Rule-based deduplication utilizes expert domain knowledge to identify and remove duplicate data records. Achieving high accuracy in a rule-based system requires the creation of rules containing a good combination of discriminatory clues. Unfortunately, accurate rule-based deduplication often requires significant manual tuning of both the rules and the corresponding thresholds. This need for manual tuning reduces the efficacy of rule-based deduplication and its applicability to real-world data sets. No adequate solution exists for this problem. We propose a novel technique for rule-based deduplication. We apply individual deduplication rules, and combine the resultant match scores via learning-based information fusion. We show empirically that our fused deduplication technique achieves higher average accuracy than traditional rule-based deduplication. Further, our technique alleviates the need for manual tuning of the deduplication rules and corresponding thresholds.

Parris K. Egbert | Stephen W. Clyde | Jared Dinerstein | Sabra Dinerstein

[1] Brant C. White,et al. United States patent , 1985 .

[2] Howard B. Newcombe,et al. Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[3] Anuradha Bhamidipaty,et al. Interactive deduplication using active learning , 2002, KDD.

[4] Ahmed K. Elmagarmid,et al. Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[5] M S Waterman,et al. Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[6] Raymond J. Mooney,et al. Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[7] Marcos André Gonçalves,et al. Replica identification using genetic programming , 2008, SAC '08.

[8] Esko Ukkonen,et al. Approximate String Matching with q-grams and Maximal Matches , 1992, Theor. Comput. Sci..

[9] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[10] William W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[11] Charles Elkan,et al. The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[12] Arun Ross,et al. Multibiometric systems , 2004, CACM.

[13] Stuart E. Madnick,et al. The inter-database instance identification problem in integrating autonomous systems , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[14] William E. Winkler,et al. The State of Record Linkage and Current Research Problems , 1999 .

[15] Peter Christen,et al. Towards Automated Record Linkage , 2006, AusDM.