Probabilistic record linkage

Abstract Studies involving the use of probabilistic record linkage are becoming increasingly common. However, the methods underpinning probabilistic record linkage are not widely taught or understood, and therefore these studies can appear to be a ‘black box’ research tool. In this article, we aim to describe the process of probabilistic record linkage through a simple exemplar. We first introduce the concept of deterministic linkage and contrast this with probabilistic linkage. We illustrate each step of the process using a simple exemplar and describe the data structure required to perform a probabilistic linkage. We describe the process of calculating and interpreting matched weights and how to convert matched weights into posterior probabilities of a match using Bayes theorem. We conclude this article with a brief discussion of some of the computational demands of record linkage, how you might assess the quality of your linkage algorithm, and how epidemiologists can maximize the value of their record-linked research using robust record linkage methods.

[1]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[2]  Harvey Goldstein,et al.  The analysis of record‐linked data using multiple imputation with data value priors , 2012, Statistics in medicine.

[3]  H. Goldstein,et al.  Correction , 2012, Journal of Epidemiology & Community Health.

[4]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[5]  David Moher,et al.  Setting the RECORD straight: developing a guideline for the REporting of studies Conducted using Observational Routinely collected Data , 2013, Clinical epidemiology.

[6]  J. Marc Overhage,et al.  Analysis of a Probabilistic Record Linkage Technique without Human Review , 2003, AMIA.

[7]  S. Pocock,et al.  The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. , 2007, Preventive medicine.

[8]  Rob Hall,et al.  Privacy-Preserving Record Linkage , 2010, Privacy in Statistical Databases.

[9]  T. Blakely,et al.  Probabilistic record linkage and a method to calculate the positive predictive value. , 2002, International journal of epidemiology.

[10]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[11]  David Moher,et al.  The REporting of Studies Conducted Using Observational Routinely-Collected Health Data (RECORD) Statement: Methods for Arriving at Consensus and Developing Reporting Guidelines , 2015, PloS one.

[12]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[13]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[14]  M. Graffar [Modern epidemiology]. , 1971, Bruxelles medical.

[15]  Harvey Goldstein,et al.  Paediatric Intensive Care , 2013 .

[16]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[17]  Adrian Sayers NYSIIS: Stata module to calculate nysiis codes from string variables , 2014 .

[18]  Michael Barker,et al.  STRDIST: Stata module to calculate the Levenshtein distance, or edit distance, between strings , 2012 .

[19]  James J. Feigenbaum JAROWINKLER: Stata module to calculate the Jaro-Winkler distance between strings , 2014 .

[20]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[21]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[22]  H. Newcombe Strategy and art in automated death searches. , 1984, American journal of public health.

[23]  Sander Greenland,et al.  Modern Epidemiology 3rd edition , 1986 .

[24]  D. Clark,et al.  Practical introduction to record linkage for injury research , 2004, Injury Prevention.

[25]  H. Goldstein,et al.  Evaluating bias due to data linkage error in electronic healthcare records , 2014, BMC Medical Research Methodology.