Active learning strategies for the deduplication of electronic patient data using classification trees

INTRODUCTION Supervised record linkage methods often require a clerical review to gain informative training data. Active learning means to actively prompt the user to label data with special characteristics in order to minimise the review costs. We conducted an empirical evaluation to investigate whether a simple active learning strategy using binary comparison patterns is sufficient or if string metrics together with a more sophisticated algorithm are necessary to achieve high accuracies with a small training set. MATERIAL AND METHODS Based on medical registry data with different numbers of attributes, we used active learning to acquire training sets for classification trees, which were then used to classify the remaining data. Active learning for binary patterns means that every distinct comparison pattern represents a stratum from which one item is sampled. Active learning for patterns consisting of the Levenshtein string metric values uses an iterative process where the most informative and representative examples are added to the training set. In this context, we extended the active learning strategy by Sarawagi and Bhamidipaty (2002). RESULTS On the original data set, active learning based on binary comparison patterns leads to the best results. When dropping four or six attributes, using string metrics leads to better results. In both cases, not more than 200 manually reviewed training examples are necessary. CONCLUSIONS In record linkage applications where only forename, name and birthday are available as attributes, we suggest the sophisticated active learning strategy based on string metrics in order to achieve highly accurate results. We recommend the simple strategy if more attributes are available, as in our study. In both cases, active learning significantly reduces the amount of manual involvement in training data selection compared to usual record linkage settings.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Ahmed K. Elmagarmid,et al.  Automating the approximate record-matching process , 2000, Inf. Sci..

[3]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[4]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[5]  Murat Sariyar,et al.  Controlling false match rates in record linkage using extreme value theory , 2011, J. Biomed. Informatics.

[6]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[7]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[8]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[10]  Murat Sariyar,et al.  Missing values in deduplication of electronic patient data , 2012, J. Am. Medical Informatics Assoc..

[11]  Matthias Egger,et al.  The Swiss National Cohort: a unique database for national and international researchers , 2010, International Journal of Public Health.

[12]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[13]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[14]  William E. Yancey Evaluating String Comparator Performance for Record Linkage , 2005 .

[15]  Dennis Shasha,et al.  Efficient data reconciliation , 2001, Inf. Sci..

[16]  A note on the distribution of the Wilcoxon rank sum statistic , 1992 .

[17]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[18]  Gisele L. Pappa,et al.  Active Learning Genetic programming for record deduplication , 2010, IEEE Congress on Evolutionary Computation.

[19]  Murat Sariyar,et al.  The RecordLinkage Package: Detecting Errors in Data , 2010, R J..

[20]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[21]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[22]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[23]  Murat Sariyar,et al.  Evaluation of Record Linkage Methods for Iterative Insertions , 2009, Methods of Information in Medicine.

[24]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[25]  Thanaa M. Ghanem,et al.  Record Linkage: A Machine Learning Approach, A Toolbox, and a Digital Government Web Service , 2003 .

[26]  Craig A. Knoblock,et al.  Automatically Utilizing Secondary Sources to Align Information Across Sources , 2005, AI Mag..