Improved suffix blocking for record linkage and entity resolution

Abstract Record linkage is the problem that identifies the different records that represent the same real-world object. Entity resolution is the problem that ensures that a real-world object is represented by a single record. The incremental versions of record linkage and entity resolution address the respective problems after the insertion of a new record in the dataset. Record linkage, entity resolution and their incremental versions are of paramount importance and arise in several contexts such as data warehouses, heterogeneous databases and data analysis. Blocking techniques are usually utilized to address these problems in order to avoid comparing all record pairs. Suffix blocking is one of the most efficient and accurate blocking techniques. In this paper, we consider the non-incremental variation of record linkage and present a method that is more than five times faster and achieves similar accuracy to the current state-of-the-art suffix-based blocking method. Then, we consider the incremental variation of record linkage and propose a novel incremental suffix-based blocking mechanism that outperforms existing incremental blocking methods in terms of blocking accuracy and efficiency. Finally, we consider incremental entity resolution and present two novel techniques based on suffix blocking that are able to handle the tested dataset in a few seconds (while a current state-of-the-art technique requires more than eight hours). Our second technique proposes a novel method that keeps a history of the deleted records and the merging process. Thus, we are able to discover alternative matches for the inserted record that are not possible for existing methods and improve the accuracy of the algorithm. We have implemented and extensively experimentally evaluated all our methods. We offer two implementations of our proposals. The first one is memory-based and offers the best efficiency while the second one is disk-based and scales seamlessly to very large datasets.

[1]  Andreas Thor,et al.  Multi-pass sorted neighborhood blocking with MapReduce , 2012, Computer Science - Research and Development.

[2]  Divesh Srivastava,et al.  Incremental maintenance of length normalized indexes for approximate string matching , 2009, SIGMOD Conference.

[3]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[4]  Hector Garcia-Molina,et al.  Incremental entity resolution on rules and data , 2014, The VLDB Journal.

[5]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[6]  Peter Christen,et al.  Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface , 2008, KDD.

[7]  Panos Vassiliadis,et al.  Meshing Streaming Updates with Persistent Data in an Active Data Warehouse , 2008, IEEE Transactions on Knowledge and Data Engineering.

[8]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[9]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[10]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[11]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[12]  Divesh Srivastava,et al.  Incremental Record Linkage , 2014, Proc. VLDB Endow..

[13]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[14]  Felix Naumann,et al.  Progressive Duplicate Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[15]  Avigdor Gal,et al.  MFIBlocks: An effective blocking algorithm for entity resolution , 2013, Inf. Syst..

[16]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[17]  Hector Garcia-Molina,et al.  Evaluating entity resolution results , 2010, Proc. VLDB Endow..

[18]  Sonia Bergamaschi,et al.  BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[19]  Paolo Ferragina,et al.  A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[20]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[21]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[22]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[23]  Joong Chae Na,et al.  Simple Implementation of String B-Trees , 2004, SPIRE.

[24]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[25]  Keizo Oyama,et al.  A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[26]  Sanjay Chawla,et al.  Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters , 2011, TKDD.

[27]  Divesh Srivastava,et al.  Online Entity Resolution Using an Oracle , 2016, Proc. VLDB Endow..

[28]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[29]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[30]  Aeilko H. Zwinderman,et al.  A Probabilistic Record Linkage Model for Survival Data , 2017 .

[31]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[32]  Aamod Sane,et al.  Fast and accurate incremental entity resolution relative to an entity knowledge base , 2012, CIKM '12.

[33]  Murat Sariyar,et al.  Controlling false match rates in record linkage using extreme value theory , 2011, J. Biomed. Informatics.

[34]  George Papastefanatos,et al.  Scaling Entity Resolution to Large, Heterogeneous Data with Enhanced Meta-blocking , 2016, EDBT.

[35]  Vijay S. Mookerjee,et al.  Efficient Techniques for Online Record Linkage , 2011, IEEE Transactions on Knowledge and Data Engineering.