Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables

In record linkage studies, unique identifiers are often not available, and therefore, the linkage procedure depends on combinations of partially identifying variables with low discriminating power. As a consequence, wrongly linked covariate and outcome pairs will be created and bias further analysis of the linked data. In this article, we investigated two estimators that correct for linkage error in regression analysis. We extended the estimators developed by Lahiri and Larsen and also suggested a weighted least squares approach to deal with linkage error. We considered both linear and logistic regression problems and evaluated the performance of both methods with simulations. Our results show that all wrong covariate and outcome pairs need to be removed from the analysis in order to calculate unbiased regression coefficients in both approaches. This removal requires strong assumptions on the structure of the data. In addition, the bias significantly increases when the assumptions do not hold and wrongly linked records influence the coefficient estimation. Our simulations showed that both methods had similar performance in linear regression problems. With logistic regression problems, the weighted least squares method showed less bias. Because the specific structure of the data in record linkage problems often leads to different assumptions, it is necessary that the analyst has prior knowledge on the nature of the data. These assumptions are more easily introduced in the weighted least squares approach than in the Lahiri and Larsen estimator.

[1]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[2]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[3]  Gunky Kim,et al.  Regression analysis under incomplete linkage , 2012, Comput. Stat. Data Anal..

[4]  B. Yawn,et al.  American Journal of Epidemiology Practice of Epidemiology Use of a Medical Records Linkage System to Enumerate a Dynamic Population over Time: the Rochester Epidemiology Project , 2022 .

[5]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[6]  James O. Chipperfield,et al.  Inference Based on Estimating Equations and Probability-Linked Data , 2009 .

[7]  John Neter,et al.  The Effect of Mismatching on the Measurement of Response Errors , 1965 .

[8]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[9]  H. Quan,et al.  Assessing record linkage between health care and Vital Statistics databases using deterministic methods , 2006, BMC Health Services Research.

[10]  William E. Winkler,et al.  Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .

[11]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[12]  Fritz Scheuren,et al.  Regression Analysis of Data Files that Are Computer Matched , 1993 .

[13]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[14]  William E. Yancey Improving EM Algorithm Estimates for Record Linkage Parameters , 2002 .

[15]  Scott L. DuVall,et al.  Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators , 2010, J. Biomed. Informatics.

[16]  G R Howe,et al.  Use of computerized record linkage in cohort studies. , 1998, Epidemiologic reviews.

[17]  T. Blakely,et al.  Probabilistic record linkage and a method to calculate the positive predictive value. , 2002, International journal of epidemiology.

[18]  M. Goldacre,et al.  Computerised linking of medical records: methodological guidelines. , 1993, Journal of epidemiology and community health.

[19]  Murat Sariyar,et al.  Controlling false match rates in record linkage using extreme value theory , 2011, J. Biomed. Informatics.

[20]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[21]  Arie Hasman,et al.  Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage. , 2011, Journal of clinical epidemiology.