Efficient Sequential and Parallel Algorithms for Incremental Record Linkage

Given a collection of records, the problem of record linkage is to cluster them such that each cluster contains all the records of one and only one individual. Existing algorithms for this important problem have large run times especially when the number of records is large. Often, a small number of new records have to be linked with a large number of existing records. Linking the old and new records together might call for large run times. We refer to any algorithm that efficiently links the new records with the existing ones as incremental record linkage (IRL) algorithms and in this paper, we offer novel IRL algorithms. Clustering is the basic approach we employ. Our algorithms use a novel random sampling technique to compute the distance between a new record and any cluster and associate the new record with the cluster with which it has the least distance. The idea is to compute the distance between the new record and only a random subset of the cluster records. We can use a sampling lemma to show that this computation is very accurate. We have developed both sequential and parallel implementations of our algorithms. They outperform the best-known prior algorithm (called RLA). For example, one of our algorithms takes 71.22 s to link 100,000 records with a database of 1,000,000 records. In comparison, the current best algorithm takes 140.91 s to link 1,100,000 records. We achieve a very nearly linear speedup in parallel. E.g., we obtain a speedup of 28.28 with 32 cores. To the best of our knowledge, we are the first to propose parallel IRL algorithms. Our algorithms offer state-of-the-art solutions to the IRL problem.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  Hector Garcia-Molina,et al.  Incremental entity resolution on rules and data , 2014, The VLDB Journal.

[3]  Carlos Alberto Heuser,et al.  A fast approach for parallel deduplication on multicore processors , 2011, SAC '11.

[4]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[5]  Sanguthevar Rajasekaran Efficient parallel hierarchical clustering algorithms , 2005, IEEE Transactions on Parallel and Distributed Systems.

[6]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[7]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[8]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[9]  Sanguthevar Rajasekaran,et al.  Derivation of Randomized Sorting and Selection Algorithms , 1993 .

[10]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[11]  James J. Lu,et al.  FRIL: A Tool for Comparative Record Linkage , 2008, AMIA.

[12]  C. Shen,et al.  Linkage of patient records from disparate sources , 2013, Statistical methods in medical research.

[13]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[14]  Hector Garcia-Molina,et al.  P-Swoosh: Parallel Algorithm for Generic Entity Resolution , 2006 .

[15]  Stephen E. Fienberg,et al.  A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[16]  Hector Garcia-Molina,et al.  Entity resolution with evolving rules , 2010, Proc. VLDB Endow..

[17]  Sanguthevar Rajasekaran,et al.  Data Integration on Multiple Data Sets , 2008, 2008 IEEE International Conference on Bioinformatics and Biomedicine.

[18]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[19]  Shi-Jinn Horng,et al.  Efficient Parallel Algorithms for Hierarchical Clustering on Arrays with Reconfigurable Optical Buses , 2000, J. Parallel Distributed Comput..

[20]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[21]  James J. Lu,et al.  Fine-grained record integration and linkage tool. , 2008, Birth defects research. Part A, Clinical and molecular teratology.

[22]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[23]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[24]  Xiaobo Li,et al.  Parallel Algorithms for Hierarchical Clustering and Cluster Validity , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[26]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[27]  Tobias Bachteler,et al.  Similarity Filtering with Multibit Trees for Record Linkage , 2013 .

[28]  Sanguthevar Rajasekaran,et al.  Efficient sequential and parallel algorithms for record linkage , 2013, J. Am. Medical Informatics Assoc..

[29]  Shanti Gomatam,et al.  An empirical comparison of record linkage procedures , 2002, Statistics in medicine.

[30]  Sanguthevar Rajasekaran,et al.  Efficient algorithms for fast integration on large data sets from multiple sources , 2012, BMC Medical Informatics and Decision Making.

[31]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[32]  David Guy Brizan,et al.  A. Survey of Entity Resolution and Record Linkage Methodologies , 2015, Communications of the IIMA.