Efficient Sequential and Parallel Algorithms for Incremental Record Linkage Using Complete Linkage Clustering

In the biomedical domain, the record linkage is considered as a crucial problem. When the number of records is very large, existing algorithms for record linkage take too much time. Often, we have to link a small set of new records with a large set of old records. This can be done by putting together the old and new records and performing a linkage on all the records. Clearly, this will call for an enormous amount of time. An alternative is to develop algorithms that perform linkage in an incremental manner. We refer to any such algorithm as an Incremental Record Linkage (IRL) algorithm. In this paper we present an efficient IRL algorithm. In addition to taking large amounts of time, existing algorithms might also suffer from a chaining problem and hence introduce some errors in linking. As has been observed in the literature, this chaining problem can be solved by performing clustering under complete linkage. The IRL algorithm we present in this paper employs complete linkage and is called as Incremental Record Linking Algorithm using Complete Linkage “IRLA-CL”. We propose sequential and parallel versions of this algorithm. IRLA-CL can handle any number of datasets. In contrast, many of the existing algorithms can only link two datasets at a time. Our algorithm outperforms previous algorithms and offer state-of-the-art solutions to the IRL problem as well. Our algorithms have been tested on millions of records on synthetic and real datasets and outperform the best-known RLA-CL algorithm when the number of new records is up to around 20% of the total number of old records and achieve a very nearly linear speedup in parallel.

[1]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[2]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[3]  Hector Garcia-Molina,et al.  Incremental entity resolution on rules and data , 2014, The VLDB Journal.

[4]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[5]  Sanguthevar Rajasekaran,et al.  Efficient Record Linkage Algorithms Using Complete Linkage Clustering , 2016, PloS one.

[6]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[7]  Peter Christen,et al.  Febrl - A Parallel Open Source Data Linkage System: http://datamining.anu.edu.au/linkage.html , 2004, PAKDD.

[8]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[9]  Shanti Gomatam,et al.  An empirical comparison of record linkage procedures , 2002, Statistics in medicine.

[10]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[11]  James J. Lu,et al.  FRIL: A Tool for Comparative Record Linkage , 2008, AMIA.

[12]  James J. Lu,et al.  Fine-grained record integration and linkage tool. , 2008, Birth defects research. Part A, Clinical and molecular teratology.

[13]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[14]  Tobias Bachteler,et al.  Similarity Filtering with Multibit Trees for Record Linkage , 2013 .

[15]  Sanguthevar Rajasekaran,et al.  Efficient sequential and parallel algorithms for record linkage , 2013, J. Am. Medical Informatics Assoc..

[16]  David Guy Brizan,et al.  A. Survey of Entity Resolution and Record Linkage Methodologies , 2015, Communications of the IIMA.

[17]  Sanguthevar Rajasekaran Efficient parallel hierarchical clustering algorithms , 2005, IEEE Transactions on Parallel and Distributed Systems.

[18]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[19]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[20]  Xiaobo Li,et al.  Parallel Algorithms for Hierarchical Clustering and Cluster Validity , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Stephen E. Fienberg,et al.  A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[22]  Hector Garcia-Molina,et al.  Entity resolution with evolving rules , 2010, Proc. VLDB Endow..

[23]  Shi-Jinn Horng,et al.  Efficient Parallel Algorithms for Hierarchical Clustering on Arrays with Reconfigurable Optical Buses , 2000, J. Parallel Distributed Comput..

[24]  Sanguthevar Rajasekaran,et al.  Efficient algorithms for fast integration on large data sets from multiple sources , 2012, BMC Medical Informatics and Decision Making.

[25]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.