论文信息 - PC-Filter: A Robust Filtering Technique for Duplicate Record Detection in Large Databases

PC-Filter: A Robust Filtering Technique for Duplicate Record Detection in Large Databases

In this paper, we will propose PC-Filter (PC stands for Partitio n Comparison), a robust data filter for approximately duplicate record detection in large databases. PC-Filter distinguishes itself from all of existing methods by using the notion of partition in duplicate detection. It first sorts the whole database and splits the sorted database into a number of record partitions. The Partition Comparison Graph (PCG) is then constructed by performing fast partition pruning. Finally, duplicate records are effectively detected by using internal and external partition comparison based on PCG. Four properties, used as heuristics, have been devised to achieve a remarkable efficiency of the filter based on triangle inequity of record similarity. PC-Filter is insensitive to the key used to sort the database, and can achieve a very good recall level that is comparable to that of the pair-wise record comparison method but only with a complexity of O(N 4/3). Equipping existing detection methods with PC-Filter, we are able to well solve the ”Key Selection” problem, the ”Scope Specification” problem and the ”Low Recall” problem that the existing methods suffer from.

[1] Salvatore J. Stolfo,et al. The merge/purge problem for large databases , 1995, SIGMOD '95.

[2] Mauricio Antonio Hernandez-Sherrington. A generalization of band joins and the merge/purge problem , 1996 .

[3] Zhao Li,et al. A fast filtering scheme for large database cleansing , 2002, CIKM '02.

[4] Charles Elkan,et al. The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[5] Surajit Chaudhuri,et al. Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[6] Charles Elkan,et al. An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[7] Rajeev Motwani,et al. Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[8] Luis Gravano,et al. Text joins for data cleansing and integration in an RDBMS , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[9] Tok Wang Ling,et al. A New Efficient Data Cleansing Method , 2002, DEXA.