A mixture model for the analysis of data derived from record linkage

Combining information from two data sources depends on finding records that belong to the same individual (matches). Sometimes, unique identifiers per individual are not available, and we have to rely on partially identifying variables that are registered in both data sources. A risk of relying on these variables is that some records from both datasets are wrongly linked to each other, which introduces bias in further regression analyses. In this paper, we propose a mixture model where we treat the indicator whether records belong to the same individual as missing. Each pair of records from both datasets contributes independently to a pairwise pseudo-likelihood, which we maximize with an expectation-maximization algorithm. Each part of the pseudo-likelihood is parameterized by the appropriate (parametric) density function. Moreover, some structures of the data allow for simplifying assumptions, which makes the pseudo-likelihood considerably easier to parameterize. Because the optimization requires a product over all combinations of records from both datasets, we suggest a procedure that summarizes information from highly unlikely matches. With simulations, we showed that the new approach produces accurate estimates in different linkage scenarios. Moreover, the estimator remained accurate in scenarios where previously proposed analysis approaches give biased results. We applied the method to estimation of the association between pregnancy duration of the first and second born children from the same mother from a register without mother identifier.

[1]  Johannes B Reitsma,et al.  Probabilistic record linkage is a valid and transparent tool to combine databases without a patient identification number. , 2007, Journal of clinical epidemiology.

[2]  Larry S. Davis,et al.  Density Estimation Using Mixtures of Mixtures of Gaussians , 2006, ECCV.

[3]  G R Howe,et al.  Use of computerized record linkage in cohort studies. , 1998, Epidemiologic reviews.

[4]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[5]  M. Goldacre,et al.  Computerised linking of medical records: methodological guidelines. , 1993, Journal of epidemiology and community health.

[6]  Harvey Goldstein,et al.  The analysis of record‐linked data using multiple imputation with data value priors , 2012, Statistics in medicine.

[7]  Dirk P. Kroese,et al.  Kernel density estimation via diffusion , 2010, 1011.2602.

[8]  R. Nelsen An Introduction to Copulas (Springer Series in Statistics) , 2006 .

[9]  Gunky Kim,et al.  Regression analysis under incomplete linkage , 2012, Comput. Stat. Data Anal..

[10]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[11]  James O. Chipperfield,et al.  Maximum Likelihood Estimation for Contingency Tables and Logistic Regression with Incorrectly Linked Data , 2011 .

[12]  William E. Winkler,et al.  Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .

[13]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[14]  John Neter,et al.  The Effect of Mismatching on the Measurement of Response Errors , 1965 .

[15]  J. Simonoff Multivariate Density Estimation , 1996 .

[16]  Fritz Scheuren,et al.  Regression Analysis of Data Files that Are Computer Matched , 1993 .

[17]  P. X. Song,et al.  Multivariate Dispersion Models Generated From Gaussian Copula , 2000 .

[18]  Scott L. DuVall,et al.  Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators , 2010, J. Biomed. Informatics.

[19]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[20]  Jing Wang,et al.  Distribution function estimation by constrained polynomial spline regression , 2010 .

[21]  Josef Schürle A method for consideration of conditional dependencies in the Fellegi and Sunter model of record linkage , 2005 .

[22]  M. Hof,et al.  Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables , 2012, Statistics in medicine.

[23]  Ray Chambers,et al.  Regression Analysis of Probability-Linked Data , 2009 .

[24]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[25]  T. Blakely,et al.  Probabilistic record linkage and a method to calculate the positive predictive value. , 2002, International journal of epidemiology.

[26]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[27]  Raphael Gottardo,et al.  Flexible mixture modeling via the multivariate t distribution with the Box-Cox transformation: an alternative to the skew-t distribution , 2010, Statistics and Computing.