Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes

We develop an algorithm for probabilistic linkage of de-identified research datasets at the patient level, when only diagnosis codes with discrepancies and no personal health identifiers such as name or date of birth are available. It relies on Bayesian modelling of binarized diagnosis codes, and provides a posterior probability of matching for each patient pair, while considering all the data at once. Both in our simulation study (using an administrative claims dataset for data generation) and in two real use-cases linking patient electronic health records from a large tertiary care network, our method exhibits good performance and compares favourably to the standard baseline Fellegi-Sunter algorithm. We propose a scalable, fast and efficient open-source implementation in the ludic R package available on CRAN, which also includes the anonymized diagnosis code data from our real use-case. This work suggests it is possible to link de-identified research databases stripped of any personal health identifiers using only diagnosis codes, provided sufficient information is shared between the data sources.

[1]  Jimeng Sun,et al.  Publishing data from electronic health records while preserving privacy: A survey of algorithms , 2014, J. Biomed. Informatics.

[2]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[3]  K. Liao,et al.  Genetic Risk Score Predicting Risk of Rheumatoid Arthritis Phenotypes and Age of Symptom Onset , 2011, PloS one.

[4]  Harry Zhang,et al.  Exploring Conditions For The Optimality Of Naïve Bayes , 2005, Int. J. Pattern Recognit. Artif. Intell..

[5]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[6]  I. Kohane,et al.  Methods to Develop an Electronic Medical Record Phenotype Algorithm to Compare the Risk of Coronary Artery Disease across 3 Chronic Disease Cohorts , 2015, PloS one.

[7]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[8]  Jared S. Murray Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering , 2015, J. Priv. Confidentiality.

[9]  Fiona Steele,et al.  Probabilistic record linkage , 2015, International journal of epidemiology.

[10]  H. Keenan,et al.  Linked Records of Children with Traumatic Brain Injury. Probabilistic Linkage without Use of Protected Health Information. , 2015, Methods of information in medicine.

[11]  Peter J. Diggle,et al.  Statistics: a data science for the 21st century , 2015 .

[12]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[13]  Harry Zhang,et al.  Naive Bayes for optimal ranking , 2008, J. Exp. Theor. Artif. Intell..

[14]  Bradley Malin,et al.  Design and implementation of a privacy preserving electronic health record linkage tool in Chicago , 2015, J. Am. Medical Informatics Assoc..

[15]  M. Law,et al.  Poor record linkage sensitivity biased outcomes in a linked cohort analysis. , 2016, Journal of clinical epidemiology.

[16]  Stanley Trepetin Privacy-Preserving String Comparisons in Record Linkage Systems: A Review , 2008, Inf. Secur. J. A Glob. Perspect..

[17]  M. Egger,et al.  Record linkage to correct under‐ascertainment of cancers in HIV cohorts: The Sinikithemba HIV clinic linkage project , 2016, International journal of cancer.

[18]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[19]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[20]  Kurt Schmidlin,et al.  Privacy Preserving Probabilistic Record Linkage (P3RL): a novel method for linking existing health-related data and maintaining participant confidentiality , 2015, BMC Medical Research Methodology.

[21]  Jeffrey R Curtis,et al.  Linkage of a De‐Identified United States Rheumatoid Arthritis Registry With Administrative Data to Facilitate Comparative Effectiveness Research , 2014, Arthritis care & research.

[22]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[23]  Harvey Goldstein,et al.  Methodological Developments in Data Linkage: Harron/Methodological Developments in Data Linkage , 2015 .

[24]  Spiros Skiadopoulos,et al.  Anonymizing datasets with demographics and diagnosis codes in the presence of utility constraints , 2017, J. Biomed. Informatics.

[25]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[26]  N. Adler,et al.  Using Electronic Health Records for Population Health Research: A Review of Methods and Applications. , 2016, Annual review of public health.

[27]  Harvey Goldstein,et al.  Methodological Developments in Data Linkage , 2015 .

[28]  Kui Wang,et al.  A bivariate zero-inflated Poisson regression model to analyze occupational injuries. , 2003, Accident; analysis and prevention.

[30]  William E. Winkler,et al.  Frequency-Based Matching in the Fellegi-Sunter Model of Record Linkage , 1989 .

[31]  J. Marc Overhage,et al.  Analysis of a Probabilistic Record Linkage Technique without Human Review , 2003, AMIA.

[32]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[33]  Ashwin Machanavajjhala,et al.  Privacy preserving interactive record linkage (PPIRL) , 2014, J. Am. Medical Informatics Assoc..

[34]  Robert C. Wolpert,et al.  A Review of the , 1985 .

[35]  Jing Cui,et al.  Using genetic and clinical data to understand response to disease-modifying anti-rheumatic drug therapy: data from the Brigham and Women's Hospital Rheumatoid Arthritis Sequential Study. , 2011, Rheumatology.

[36]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .