The merge/purge problem for large databases

Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.

[1]  Charles L. Forgy,et al.  OPS5 user's manual , 1981 .

[2]  Stuart E. Madnick,et al.  The inter-database instance identification problem in integrating autonomous systems , 1989, [1989] Proceedings. Fifth International Conference on Data Engineering.

[3]  David J. DeWitt,et al.  Physical database design in multiprocessor database systems , 1990 .

[4]  Michael Allen Bickel,et al.  Automatic correction to misspelled names: a fourth-generation language approach , 1987, CACM.

[5]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[6]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[7]  David B. Lomet,et al.  AlphaSort: a RISC machine sort , 1994, SIGMOD '94.

[8]  Mauricio A. Hernandez A Generalization of Band Joins and the Merge-Purge Problem , 1995 .

[9]  William Kent,et al.  The breakdown of the information model in multi-database systems , 1991, SGMD.

[10]  H. V. Jagadish,et al.  Multiprocessor Transitive Closure Algorithms , 1988, Proceedings [1988] International Symposium on Databases in Parallel and Distributed Systems.

[11]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[12]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[13]  Mauricio Antonio Hernandez-Sherrington A generalization of band joins and the merge/purge problem , 1996 .

[14]  Salvatore J. Stolfo,et al.  Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.