A Bayesian Approach to Linking Data Without Unique Identifiers

Existing file linkage methods may produce sub-optimal results because they consider neither the interactions between different pairs of matched records nor relationships between variables that are exclusive to one of the files. In addition, many of the current methods fail to address the uncertainty in the linkage, which may result in overly precise estimates of relationships between variables that are exclusive to one of the files. Bayesian methods for record linkage can reduce the bias in the estimation of scientific relationships of interest and provide interval estimates that account for the uncertainty in the linkage; however, implementation of these methods can often be complex and computationally intensive. This article presents the GFS package for the R programming language that utilizes a Bayesian approach for file linkage. The linking procedure implemented in GFS samples from the joint posterior distribution of model parameters and the linking permutations. The algorithm approaches file linkage as a missing data problem and generates multiple linked data sets. For computational efficiency, only the linkage permutations are stored and multiple analyses are performed using each of the permutations separately. This implementation reduces the computational complexity of the linking process and the expertise required of researchers analyzing linked data sets. We describe the algorithm implemented in the GFS package and its statistical basis, and demonstrate its use on a sample data set.

[1]  Susanne Rässler,et al.  Statistical Matching: "A Frequentist Theory, Practical Applications, And Alternative Bayesian Approaches" , 2002 .

[2]  Ted Enamorado,et al.  Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records , 2018, American Political Science Review.

[3]  Fred L. Drake,et al.  Python 3 Reference Manual , 2009 .

[4]  Jared S. Murray Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering , 2015, J. Priv. Confidentiality.

[5]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[6]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[7]  R Gutman,et al.  Error adjustments for file linking methods using encrypted unique client identifier (eUCI) with application to recently released prisoners who are HIV+ , 2016, Statistics in medicine.

[8]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[9]  Giacomo Zanella,et al.  Informed Proposals for Local MCMC in Discrete Spaces , 2017, Journal of the American Statistical Association.

[10]  Donald B. Rubin,et al.  Characterizing the Estimation of Parameters in Incomplete-Data Problems , 1974 .

[11]  Jerome P. Reiter,et al.  Regression Modeling and File Matching Using Possibly Erroneous Matching Variables , 2016, Journal of Computational and Graphical Statistics.

[12]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[13]  Xiao-Li Meng,et al.  Multiple-Imputation Inferences with Uncongenial Sources of Input , 1994 .

[14]  Muazzam Nasrullah,et al.  HIV Testing and Intimate Partner Violence Among Non-Pregnant Women in 15 US States/Territories: Findings from Behavioral Risk Factor Surveillance System Survey Data , 2013, AIDS and Behavior.

[15]  Shanti Gomatam,et al.  An empirical comparison of record linkage procedures , 2002, Statistics in medicine.

[16]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[17]  Rob Hall,et al.  A Bayesian Approach to Graphical Record Linkage and Deduplication , 2016 .

[18]  Ray Chambers,et al.  Regression Analysis of Probability-Linked Data , 2009 .

[19]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[20]  James O. Chipperfield,et al.  Inference Based on Estimating Equations and Probability-Linked Data , 2009 .

[21]  Nuno Crato,et al.  Data-Driven Policy Impact Evaluation , 2018 .

[22]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[23]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[24]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[25]  Alan M Zaslavsky,et al.  A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs , 2013, Journal of the American Statistical Association.

[26]  Alan F. Karr,et al.  Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches , 2007 .

[27]  John Salvatier,et al.  Probabilistic programming in Python using PyMC3 , 2016, PeerJ Comput. Sci..

[28]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[29]  Partha Lahiri,et al.  Statistical Analysis with Linked Data , 2018, International Statistical Review.

[30]  Marcello D’ORAZIO,et al.  Integration and imputation of survey data in R: the StatMatch package , 2015 .

[31]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[32]  Marcello D'Orazio,et al.  Statistical Matching: Theory and Practice , 2006 .

[33]  Dennis Deck,et al.  Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a `basic' deterministic algorithm , 2008, Health Informatics J..

[34]  Willard L. Rodgers,et al.  An Evaluation of Statistical Matching , 1984 .

[35]  Mauricio Sadinle,et al.  Bayesian Estimation of Bipartite Matchings for Record Linkage , 2016, 1601.06630.

[36]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[37]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[38]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[39]  D. Rubin,et al.  Small-sample degrees of freedom with multiple imputation , 1999 .