Regression analysis with linked data: problems and possible solutions

In this paper we have described and extended some recent proposals on a general Bayesian methodology for performing record linkage and making inference using the resulting matched units. In particular, we have framed the record linkage process into a formal statistical model which comprises both the matching variables and the other variables included at the inferential stage. This way, the researcher is able to account for the matching process uncertainty in inferential procedures based on probabilistically linked data, and at the same time, he/she is also able to generate a feedback propagation of the information between the working statistical model and the record linkage stage. We have argued that this feedback effect is both essential to eliminate potential biases that otherwise would characterize the resulting linked data inference, and able to improve record linkage performances. The practical implementation of the procedure is based on the use of standard Bayesian computational techniques, such as Markov Chain Monte Carlo algorithms. Although the methodology is quite general, we have restricted our analysis to the popular and important case of multiple linear regression set-up for expository convenience.

[1]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[2]  Harvey Goldstein,et al.  The analysis of record‐linked data using multiple imputation with data value priors , 2012, Statistics in medicine.

[3]  John Neter,et al.  The Effect of Mismatching on the Measurement of Response Errors , 1965 .

[4]  Peter J. Green,et al.  Bayesian alignment using hierarchical models, with applications in protein bioinformatics , 2005 .

[5]  J B Copas,et al.  Record linkage: statistical models for matching computer records. , 1990, Journal of the Royal Statistical Society. Series A,.

[6]  Harvey Goldstein,et al.  Paediatric Intensive Care , 2013 .

[7]  Alan M Zaslavsky,et al.  A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs , 2013, Journal of the American Statistical Association.

[8]  Brunero Liseo,et al.  Bayesian estimation of population size via linkage of multivariate normal data sets , 2011 .

[9]  David Lindley,et al.  A problem in forensic science , 1977 .

[10]  William E. Winkler,et al.  Matching and record linkage , 2011 .

[11]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[12]  Gunky Kim,et al.  Regression analysis under incomplete linkage , 2012, Comput. Stat. Data Anal..

[13]  M. H. P. Hof,et al.  A mixture model for the analysis of data derived from record linkage , 2015, Statistics in medicine.

[14]  Brunero Liseo,et al.  A hierarchical Bayesian approach to record linkage and population size problems , 2010, 1011.2649.

[15]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[17]  B. Liseo,et al.  On Bayesian Record Linkage , 2000 .

[18]  Fritz Scheuren,et al.  Regression Analysis of Data Files that Are Computer Matched , 1993 .

[19]  M. Hof,et al.  Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables , 2012, Statistics in medicine.

[20]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[21]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[22]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .