Comparing record linkage software programs and algorithms using real-world data

Linkage of medical databases, including insurer claims and electronic health records (EHRs), is increasingly common. However, few studies have investigated the behavior and output of linkage software. To determine how linkage quality is affected by different algorithms, blocking variables, methods for string matching and weight determination, and decision rules, we compared the performance of 4 nonproprietary linkage software packages linking patient identifiers from noninteroperable inpatient and outpatient EHRs. We linked datasets using first and last name, gender, and date of birth (DOB). We evaluated DOB and year of birth (YOB) as blocking variables and used exact and inexact matching methods. We compared the weights assigned to record pairs and evaluated how matching weights corresponded to a gold standard, medical record number. Deduplicated datasets contained 69,523 inpatient and 176,154 outpatient records, respectively. Linkage runs blocking on DOB produced weights ranging in number from 8 for exact matching to 64,273 for inexact matching. Linkage runs blocking on YOB produced 8 to 916,806 weights. Exact matching matched record pairs with identical test characteristics (sensitivity 90.48%, specificity 99.78%) for the highest ranked group, but algorithms differentially prioritized certain variables. Inexact matching behaved more variably, leading to dramatic differences in sensitivity (range 0.04–93.36%) and positive predictive value (PPV) (range 86.67–97.35%), even for the most highly ranked record pairs. Blocking on DOB led to higher PPV of highly ranked record pairs. An ensemble approach based on averaging scaled matching weights led to modestly improved accuracy. In summary, we found few differences in the rankings of record pairs with the highest matching weights across 4 linkage packages. Performance was more consistent for exact string matching than for inexact string matching. Most methods and software packages performed similarly when comparing matching accuracy with the gold standard. In some settings, an ensemble matching approach may outperform individual linkage algorithms.

[1]  Computerised record linkage: compared with traditional patient follow-up methods in clinical trials and illustrated in a prospective epidemiological study. The West of Scotland Coronary Prevention Study Group. , 1995, Journal of clinical epidemiology.

[2]  Ibrahim Abubakar,et al.  Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies , 2015, PloS one.

[3]  J. Carpenter,et al.  Multiple imputation using linked proxy outcome data resulted in important bias reduction and efficiency gains: a simulation study , 2017, Emerging Themes in Epidemiology.

[4]  Harvey Goldstein,et al.  A guide to evaluating linkage quality for the analysis of linked data , 2017, International journal of epidemiology.

[5]  Dario Gregori,et al.  The impact of record-linkage bias in the Cox model. , 2010, Journal of evaluation in clinical practice.

[6]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[7]  Fiona Steele,et al.  Probabilistic record linkage , 2015, International journal of epidemiology.

[8]  L. McDonald,et al.  Performing studies using the UK Clinical Practice Research Datalink: to link or not to link? , 2018, European Journal of Epidemiology.

[9]  Harvey Goldstein,et al.  Data linkage errors in hospital administrative data when applying a pseudonymisation algorithm to paediatric intensive care records , 2015, BMJ Open.

[10]  Yasuo Ohashi,et al.  When to conduct probabilistic linkage vs. deterministic linkage? A simulation study , 2015, J. Biomed. Informatics.

[11]  P Crosignani,et al.  The EpiLink Record Linkage Software , 2005, Methods of Information in Medicine.

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Kerri Clough-Gorr,et al.  Impact of unlinked deaths and coding changes on mortality trends in the Swiss National Cohort , 2013, BMC Medical Informatics and Decision Making.

[14]  Ian Scott,et al.  Data Linkage: A powerful research tool with potential problems , 2010, BMC health services research.

[15]  M. Law,et al.  A New Method for Assessing How Sensitivity and Specificity of Linkage Studies Affects Estimation , 2014, PloS one.

[16]  M. J. van der Laan,et al.  Statistical Applications in Genetics and Molecular Biology Super Learner , 2010 .

[17]  Sean M. Randall,et al.  The effect of data cleaning on record linkage quality , 2013, BMC Medical Informatics and Decision Making.

[18]  S. Schneeweiss,et al.  Claims‐based studies of oral glucose‐lowering medications can achieve balance in critical clinical variables only observed in electronic health records , 2018, Diabetes, obesity & metabolism.

[19]  Joseph T. Lariscy,et al.  Differential Record Linkage by Hispanic Ethnicity and Age in Linked Mortality Studies , 2011, Journal of aging and health.

[20]  Dennis Deck,et al.  Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a `basic' deterministic algorithm , 2008, Health Informatics J..

[21]  Harvey Goldstein,et al.  Utilising identifier error variation in linkage of large administrative data sources , 2017, BMC Medical Research Methodology.

[22]  Harvey Goldstein,et al.  GUILD: GUidance for Information about Linking Data sets† , 2017, Journal of public health.

[23]  David L. Banks,et al.  Data quality: A statistical perspective , 2006 .

[24]  L. Taylor,et al.  Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. , 2006, Paediatric and perinatal epidemiology.

[26]  James H. Boyd,et al.  A transparent and transportable methodology for evaluating Data Linkage software , 2012, J. Biomed. Informatics.

[27]  I. Kohane,et al.  Biases in electronic health record data due to processes within the healthcare system: retrospective observational study , 2018, British Medical Journal.