Improving record linkage performance in the presence of missing linkage data

INTRODUCTION Existing record linkage methods do not handle missing linking field values in an efficient and effective manner. The objective of this study is to investigate three novel methods for improving the accuracy and efficiency of record linkage when record linkage fields have missing values. METHODS By extending the Fellegi-Sunter scoring implementations available in the open-source Fine-grained Record Linkage (FRIL) software system we developed three novel methods to solve the missing data problem in record linkage, which we refer to as: Weight Redistribution, Distance Imputation, and Linkage Expansion. Weight Redistribution removes fields with missing data from the set of quasi-identifiers and redistributes the weight from the missing attribute based on relative proportions across the remaining available linkage fields. Distance Imputation imputes the distance between the missing data fields rather than imputing the missing data value. Linkage Expansion adds previously considered non-linkage fields to the linkage field set to compensate for the missing information in a linkage field. We tested the linkage methods using simulated data sets with varying field value corruption rates. RESULTS The methods developed had sensitivity ranging from .895 to .992 and positive predictive values (PPV) ranging from .865 to 1 in data sets with low corruption rates. Increased corruption rates lead to decreased sensitivity for all methods. CONCLUSIONS These new record linkage algorithms show promise in terms of accuracy and efficiency and may be valuable for combining large data sets at the patient level to support biomedical and clinical research.

[1]  E. Hing,et al.  Use and characteristics of electronic health record systems among office-based physician practices: United States, 2001-2012. , 2012, NCHS data brief.

[2]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[3]  Mark Elliot,et al.  An Experiment in Naive Bayesian Record Linkage , 2005 .

[4]  Stanley Lemeshow,et al.  Techniques for handling missing data in secondary analyses of large surveys. , 2010, Academic pediatrics.

[5]  E. Hing,et al.  Electronic medical record use by office-based physicians and their practices: United States, 2006. , 2007, Advance data.

[6]  Scott L. DuVall,et al.  Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators , 2010, J. Biomed. Informatics.

[7]  Max Bramer,et al.  Techniques for Dealing with Missing Values in Classification , 1997, IDA.

[8]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[9]  Matthias Egger,et al.  Electronic medical record systems, data quality and loss to follow-up: survey of antiretroviral therapy programmes in resource-limited settings. , 2008, Bulletin of the World Health Organization.

[10]  Anderson Hj Finding a needle in a haystack. , 2008 .

[11]  D. Blumenthal Stimulating the adoption of health information technology. , 2009, The West Virginia medical journal.

[12]  Peter Christen,et al.  Accurate Synthetic Generation of Realistic Personal Information , 2009, PAKDD.

[13]  William E. Winkler,et al.  AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIAL CENSUS , 1987 .

[14]  D. Randall Wilson,et al.  Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage , 2011, The 2011 International Joint Conference on Neural Networks.

[15]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[16]  Qunrui Ye,et al.  Finding a needle in a haystack , 2013, Oncoimmunology.

[17]  E. Hing,et al.  Use and characteristics of electronic health record systems among office-based physician practices: United States, 2001-2013. , 2014, NCHS data brief.

[18]  Xavier Basagaña,et al.  A framework for multiple imputation in cluster analysis. , 2013, American journal of epidemiology.

[19]  Foster J. Provost,et al.  Handling Missing Values when Applying Classification Models , 2007, J. Mach. Learn. Res..

[20]  K. Mandl,et al.  Patients treated at multiple acute health care facilities: quantifying information fragmentation. , 2010, Archives of internal medicine.

[21]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[22]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[23]  Mark R. Raymond,et al.  A Comparison of Methods for Treating Incomplete Data in Selection Research , 1987 .

[24]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[25]  W. Bilker,et al.  Descriptive analyses of the integrity of a US Medicaid claims database , 2003, Pharmacoepidemiology and drug safety.

[26]  David Haziza,et al.  Imputation and Inference in the Presence of Missing Data , 2009 .

[27]  J. Marc Overhage,et al.  Analysis of identifier performance using a deterministic linkage algorithm , 2002, AMIA.

[28]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[29]  Rainer Schnell,et al.  Bmc Medical Informatics and Decision Making Privacy-preserving Record Linkage Using Bloom Filters , 2022 .

[30]  J. Westfall,et al.  Missing clinical information during primary care visits. , 2005, JAMA.

[31]  Donald B. Rubin,et al.  19 Incomplete Data in Epidemiology and Medical Statistics , 2007 .

[32]  Robert Isele,et al.  Efficient Multidimensional Blocking for Link Discovery without losing Recall , 2011, WebDB.

[33]  Rolph E. Anderson,et al.  Multivariate Data Analysis (7th ed. , 2009 .

[34]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[35]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[36]  Peter Christen,et al.  Probabilistic Data Generation for Deduplication and Data Linkage , 2005, IDEAL.

[37]  L. Kessler,et al.  Potential for Cancer Related Health Services Research Using a Linked Medicare‐Tumor Registry Database , 1993, Medical care.

[38]  Chun-Ju Hsiao,et al.  Electronic medical record use by office-based physicians and their practices: United States, 2007. , 2010, National health statistics reports.

[39]  James J. Lu,et al.  FRIL: A Tool for Comparative Record Linkage , 2008, AMIA.

[40]  Stanley Trepetin Privacy-Preserving String Comparisons in Record Linkage Systems: A Review , 2008, Inf. Secur. J. A Glob. Perspect..

[41]  James J. Lu,et al.  Fine-grained record integration and linkage tool. , 2008, Birth defects research. Part A, Clinical and molecular teratology.

[42]  Murat Kantarcioglu,et al.  Private medical record linkage with approximate matching. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[43]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .