Extending the Fellegi-Sunter probabilistic record linkage method for approximate field comparators

Probabilistic record linkage is a method commonly used to determine whether demographic records refer to the same person. The Fellegi-Sunter method is a probabilistic approach that uses field weights based on log likelihood ratios to determine record similarity. This paper introduces an extension of the Fellegi-Sunter method that incorporates approximate field comparators in the calculation of field weights. The data warehouse of a large academic medical center was used as a case study. The approximate comparator extension was compared with the Fellegi-Sunter method in its ability to find duplicate records previously identified in the data warehouse using different demographic fields and matching cutoffs. The approximate comparator extension misclassified 25% fewer pairs and had a larger Welch's T statistic than the Fellegi-Sunter method for all field sets and matching cutoffs. The accuracy gain provided by the approximate comparator extension grew as less information was provided and as the matching cutoff increased. Given the ubiquity of linkage in both clinical and research settings, the incremental improvement of the extension has the potential to make a considerable impact.

[1]  Peter Christen,et al.  Preparation of name and address data for record linkage using hidden Markov models , 2002, BMC Medical Informatics Decis. Mak..

[2]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[3]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[4]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[5]  C Friedman,et al.  Tolerating spelling errors during patient validation. , 1992, Computers and biomedical research, an international journal.

[6]  Johannes B. Reitsma,et al.  Record Linkage: Making the Most Out of Errors in Linking Variables , 2006, AMIA.

[7]  T. Blakely,et al.  Probabilistic record linkage and a method to calculate the positive predictive value. , 2002, International journal of epidemiology.

[8]  J. Grossman,et al.  Building a Better Delivery System: A New Engineering/Health Care Partnership , 2005 .

[9]  J. Marc Overhage,et al.  Analysis of identifier performance using a deterministic linkage algorithm , 2002, AMIA.

[10]  J. Westfall,et al.  Missing clinical information during primary care visits. , 2005, JAMA.

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[13]  A. Stiell,et al.  Prevalence of information gaps in the emergency department and the effect on patient outcomes. , 2003, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[14]  J. Marc Overhage,et al.  Analysis of a Probabilistic Record Linkage Technique without Human Review , 2003, AMIA.

[15]  Sidney N. Thornton,et al.  Reducing Duplicate Patient Creation Using a Probabilistic Matching Algorithm in an Open-access Community Data Sharing Environment , 2005, AMIA.

[16]  Lisa A. Cannon Albright,et al.  Utah family-based analysis: past, present and future. , 2008 .

[17]  Susan Mays,et al.  Toward a unique patient identifier. Florida IDN attacks duplicate records with MPI software, consultation and a shift in organizational philosophy. , 2002, Health management technology.

[18]  M G Arellano,et al.  Issues in identification and linkage of patient records across an integrated delivery system. , 1998, Journal of healthcare information management : JHIM.