Improving Temporal Record Linkage Using Regression Classification

Temporal record linkage is the process of identifying groups of records that are collected over a period of time, such as in census or voter registration databases, where records in the same group represent the same real-world entity. Such databases often contain temporal information, such as the time when a record was created or when it was modified. Unlike traditional record linkage, which considers differences between records from the same entity as errors or variations, temporal record linkage aims to capture records from entities where the attribute values are known to change over time. In this paper we propose a novel approach that extends an existing temporal approach called decay model, to categorically calculate probabilities of change for each attribute. Our novel method uses a regression-based machine learning model to predict decays for sets of attributes. Each such set of attributes has a principle attribute and support attributes, where values of the support attributes can affect the decay of the principle attribute. Our experimental results on a real US voter database show that our proposed approach results in better linkage quality compared to the decay model approach.

[1]  J. Brian Gray,et al.  Introduction to Linear Regression Analysis , 2002, Technometrics.

[2]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[3]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[4]  Peter Christen,et al.  Adaptive Temporal Entity Resolution on Dynamic Databases , 2013, PAKDD.

[5]  Divesh Srivastava,et al.  Linking temporal records , 2011, Frontiers of Computer Science.

[6]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[7]  D. Hand,et al.  A note on using the F-measure for evaluating data linkage algorithms , 2016 .

[8]  Mong-Li Lee,et al.  Linking Temporal Records for Profiling Entities , 2015, SIGMOD Conference.

[9]  Douglas C. Montgomery,et al.  Introduction to Linear Regression Analysis, Solutions Manual (Wiley Series in Probability and Statistics) , 2007 .

[10]  Ashwin Machanavajjhala,et al.  Social Genome: Putting Big Data to Work for Population Informatics , 2014, Computer.

[11]  Jeffrey F. Naughton,et al.  Modeling entity evolution for temporal record matching , 2014, SIGMOD Conference.

[12]  Jeffrey F. Naughton,et al.  Tracking Entities in the Dynamic World: A Fast Algorithm for Matching Temporal Records , 2014, Proc. VLDB Endow..

[13]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[14]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.