Robust record linkage blocking using suffix arrays

Record linkage is an important data integration task that has many practical uses for matching, merging and duplicate removal in large and diverse databases. However, a quadratic scalability for the brute force approach necessitates the design of appropriate indexing or blocking techniques. We design and evaluate an efficient and highly scalable blocking approach based on suffix arrays. Our suffix grouping technique exploits the ordering used by the index to merge similar blocks at marginal extra cost, resulting in a much higher accuracy while retaining the high scalability of the base suffix array method. Efficiently grouping similar suffixes is carried out with the use of a sliding window technique. We carry out an in-depth analysis of our method and show results from experiments using real and synthetic data, which highlights the importance of using efficient indexing and blocking in real world applications where data sets contain millions of records.

[1]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[2]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[4]  Peter Christen Towards Parameter-free Blocking for Scalable Record Linkage , 2007 .

[5]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[6]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[7]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[8]  J. T. Marshall Canada's national vital statistics index , 1947 .

[9]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[10]  Stasha Ann Bown Larsen,et al.  Record Linkage , 2018, Encyclopedia of Database Systems.

[11]  M. Goldacre,et al.  Computerised linking of medical records: methodological guidelines. , 1993, Journal of epidemiology and community health.

[12]  Lei Wang,et al.  Achieving both high precision and high recall in near-duplicate detection , 2008, CIKM '08.

[13]  C. Lee Giles,et al.  Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.

[14]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[15]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[16]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[17]  Keizo Oyama,et al.  A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[18]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[19]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.