Hierarchical Bayesian Record Linkage Theory

In record linkage, or exact file matching, one compares two or more files on a single population for purposes of unduplication or production of an enhanced, merged database. Record linkage has many applications, including in population enumeration efforts, to create databases for epidemiological investigations, and to improve survey sample frames. Latent class and mixture models have been used to implement computerised record linkage of large databases. Probabilities that pairs of records, one record from each of two files, pertain to the same person (a match) or to different people (a nonmatch) are estimated based on model parameters and Bayes’ theorem. In some settings, there is experience with similar record linkage operations that can inform prior opinions concerning model parameters. In this paper, Bayesian record linkage alternatives are developed and compared through simulation. A hierarchical Bayesian model allows parameters to vary by file blocks, which are similar to geographical blocks in census applications. Techniques for incorporating one-to-one matching between files into the likelihood itself and computing posterior distributions of parameters and linkage indicators are presented.

[1]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[2]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[3]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[4]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[5]  Isaac Dialsingh,et al.  Applied Bayesian Modeling and Causal Inference from Incomplete Data Perspectives , 2005 .

[6]  David Kriebel,et al.  Occupational exposure to metalworking fluids and risk of breast cancer among female autoworkers. , 2005, American journal of industrial medicine.

[7]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[8]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[9]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[10]  Mauro Scanu,et al.  MODELLING ISSUES IN RECORD LINKAGE : A BAYESIAN PERSPECTIVE , 2002 .

[11]  Dale P Sandler,et al.  Validating Cancer Histories in Deceased Relatives , 2005, Epidemiology.

[12]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[13]  Edward H Livingston,et al.  Effect of diabetes and hypertension on obesity-related mortality. , 2005, Surgery.

[14]  Hemant Ishwaran,et al.  Identifying Likely Duplicates by Record Linkage in a Survey of Prostitutes , 2004 .

[15]  Yves Thibaudeau The Discrimination Power of Dependency Structures in Record Linkage , 1992 .

[16]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[17]  R. Burkard,et al.  Assignment and Matching Problems: Solution Methods with FORTRAN-Programs , 1980 .

[18]  Fritz Scheuren,et al.  Regression Analysis of Data Files that Are Computer Matched , 1993 .

[19]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[20]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[21]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[22]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[23]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[24]  Shawn A. Ross,et al.  Survey Methodology , 2005, The SAGE Encyclopedia of the Sociology of Religion.