Leveraging Unlabeled Data to Scale Blocking for Record Linkage

Record linkage is the process of matching records between two (or multiple) data sets that represent the same real-world entity. An exhaustive record linkage process involves computing the similarities between all pairs of records, which can be very expensive for large data sets. Blocking techniques alleviate this problem by dividing the records into blocks and only comparing records within the same block. To be adaptive from domain to domain, one category of blocking technique formalizes 'construction of blocking scheme' as a machine learning problem. In the process of learning the best blocking scheme, previous learning-based techniques utilize only a set of labeled data. However, since the set of labeled data is usually not large enough to well characterize the unseen (unlabeled) data, the resultant blocking scheme may poorly perform on the unseen data by generating too many candidate matches. To address that, in this paper, we propose to utilize unlabeled data (in addition to labeled data) for learning blocking schemes. Our experimental results show that using unlabeled data in learning can remarkably reduce the number of candidate matches while keeping the same level of coverage for true matches.

[1]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[2]  C. Lee Giles,et al.  Adaptive sorted neighborhood methods for efficient record linkage , 2007, JCDL '07.

[3]  Panagiotis Papapetrou,et al.  Nearest Neighbor Retrieval Using Distance-Based Hashing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[5]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[6]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[7]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[8]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[9]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  Dongwon Lee,et al.  HARRA: fast iterative hashed record linkage for large-scale data collections , 2010, EDBT '10.

[12]  David Peleg Approximation algorithms for the Label-CoverMAX and Red-Blue Set Cover problems , 2007, J. Discrete Algorithms.

[13]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[14]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[15]  Lifang Gu,et al.  Adaptive Filtering for Efficient Record Linkage , 2004, SDM.

[16]  S. Crawford,et al.  Volume 1 , 2012, Journal of Diabetes Investigation.

[17]  Wagner Meira,et al.  Adaptive and Flexible Blocking for Record Linkage Tasks , 2010, J. Inf. Data Manag..