A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation

INTRODUCTION Clinical databases require accurate entity resolution (ER). One approach is to use algorithms that assign questionable cases to manual review. Few studies have compared the performance of common algorithms for such a task. Furthermore, previous work has been limited by a lack of objective methods for setting algorithm parameters. We compared the performance of common ER algorithms: using algorithmic optimization, rather than manual parameter tuning, and on two-threshold classification (match/manual review/non-match) as well as single-threshold (match/non-match). METHODS We manually reviewed 20,000 randomly selected, potential duplicate record-pairs to identify matches (10,000 training set, 10,000 test set). We evaluated the probabilistic expectation maximization, simple deterministic and fuzzy inference engine (FIE) algorithms. We used particle swarm to optimize algorithm parameters for a single and for two thresholds. We ran 10 iterations of optimization using the training set and report averaged performance against the test set. RESULTS The overall estimated duplicate rate was 6%. FIE and simple deterministic algorithms allowed a lower manual review set compared to the probabilistic method (FIE 1.9%, simple deterministic 2.5%, probabilistic 3.6%; p<0.001). For a single threshold, the simple deterministic algorithm performed better than the probabilistic method (positive predictive value 0.956 vs 0.887, sensitivity 0.985 vs 0.887, p<0.001). ER with FIE classifies 98.1% of record-pairs correctly (1/10,000 error rate), assigning the remainder to manual review. CONCLUSIONS Optimized deterministic algorithms outperform the probabilistic method. There is a strong case for considering optimized deterministic methods for ER.

[1]  Julien Vayssière,et al.  An Incremental Knowledge Acquisition Method for Improving Duplicate Invoices Detection , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[2]  Vivienne J. Zhu,et al.  Research Paper: An Empiric Modification to the Probabilistic Record Linkage Algorithm Using Frequency-Based Weight Scaling , 2009, J. Am. Medical Informatics Assoc..

[3]  Scott L. DuVall,et al.  Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators , 2010, J. Biomed. Informatics.

[4]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[5]  Scott L. DuVall,et al.  The Impact of a Growing Minority Population on Identification of Duplicate Records in an Enterprise Data Warehouse , 2010, MedInfo.

[6]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[7]  Ahima Fundamentals for Building a Master Patient Index/Enterprise Master Patient Index (2010 update) , 2010 .

[8]  Dennis Deck,et al.  Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a `basic' deterministic algorithm , 2008, Health Informatics J..

[9]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[10]  Murat Sariyar,et al.  Missing values in deduplication of electronic patient data , 2012, J. Am. Medical Informatics Assoc..

[11]  Ian Witten,et al.  Data Mining , 2000 .

[12]  Ahmad Abdollahzadeh Barforoush,et al.  A Flexible Fuzzy Expert System for Fuzzy Duplicate Elimination in Data Cleaning , 2004, DEXA.

[13]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[14]  J. Marc Overhage,et al.  Analysis of a Probabilistic Record Linkage Technique without Human Review , 2003, AMIA.

[15]  Elmer V. Bernstam,et al.  Duplicate Patient Records - Implication for Missed Laboratory Results , 2012, AMIA.

[16]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.

[17]  Lifang Gu,et al.  Decision Models for Record Linkage , 2006, Selected Papers from AusDM.

[18]  Catherine Yoon,et al.  Claims, errors, and compensation payments in medical malpractice litigation. , 2006, The New England journal of medicine.

[19]  Russell C. Eberhart,et al.  A new optimizer using particle swarm theory , 1995, MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science.

[20]  Fabrice Guillet,et al.  Quality Measures in Data Mining , 2009, Studies in Computational Intelligence.

[21]  Perry L. Miller,et al.  Exploring the Utility of Demographic Data and Vaccination History Data in the Deduplication of Immunization Registry Patient Records , 2001, J. Biomed. Informatics.

[22]  Kouhei Akazawa,et al.  Profit and loss analysis for an intensive care unit (ICU) in Japan: a tool for strategic management , 2006, BMC Health Services Research.

[23]  Marcos André Gonçalves,et al.  A Genetic Programming Approach to Record Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[24]  Peter Christen,et al.  Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[25]  Dean F Sittig,et al.  Matching identifiers in electronic health records: implications for duplicate records and patient safety , 2013, BMJ quality & safety.

[26]  George V. Moustakides,et al.  A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.

[27]  Shanti Gomatam,et al.  An empirical comparison of record linkage procedures , 2002, Statistics in medicine.

[28]  Charles Elkan,et al.  Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[29]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[30]  H. Quan,et al.  Assessing record linkage between health care and Vital Statistics databases using deterministic methods , 2006, BMC Health Services Research.

[31]  J. Marc Overhage,et al.  Analysis of identifier performance using a deterministic linkage algorithm , 2002, AMIA.

[32]  Murat Sariyar,et al.  Active learning strategies for the deduplication of electronic patient data using classification trees , 2012, J. Biomed. Informatics.

[33]  Scott L. DuVall,et al.  Evaluation of record linkage between a large healthcare provider and the Utah Population Database , 2012, J. Am. Medical Informatics Assoc..