Bagging, bumping, multiview, and active learning for record linkage with empirical results on patient identity data

Record linkage or deduplication deals with the detection and deletion of duplicates in and across files. For this task, this paper introduces and evaluates two new machine-learning methods (bumping and multiview) together with bagging, a tree-based ensemble-approach. Whereas bumping represents a tree-based approach as well, multiview is based on the combination of different methods and the semi-supervised learning principle. After providing a theoretical background of the methods, initial empirical results on patient identity data are given. In the empirical evaluation, we calibrate the methods on three different kinds of training data. The results show that the smallest training data set, which is obtained by a simple active learning strategy, leads to the best results. Multiview can outperform the other methods only when all are calibrated on a randomly sampled training set; in all other cases, it performs worse. The results of bumping do not differ significantly from the overall best performing method bagging. We cautiously conclude that tree-based record linkage methods are likely to produce similar results because of the low-dimensionality (p≪n) and straightforwardness of the underlying problem. Multiview is possibly rather suitable for problems that are more sophisticated.

[1]  Carlos Soares,et al.  Is the UCI Repository Useful for Data Mining? , 2003, EPIA.

[2]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[3]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[4]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[5]  Murat Sariyar,et al.  Controlling false match rates in record linkage using extreme value theory , 2011, J. Biomed. Informatics.

[6]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[7]  ipred : Improved Predictors , 2009 .

[8]  Melba M. Crawford,et al.  View Generation for Multiview Maximum Disagreement Based Active Learning for Hyperspectral Image Classification , 2012, IEEE Transactions on Geoscience and Remote Sensing.

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Fabio Roli,et al.  Using Co-training and Self-training in Semi-supervised Multiple Classifier Systems , 2006, SSPR/SPR.

[11]  Matthias Egger,et al.  The Swiss National Cohort: a unique database for national and international researchers , 2010, International Journal of Public Health.

[12]  A Wajda,et al.  The art and science of record linkage: methods that work with few identifiers. , 1986, Computers in biology and medicine.

[13]  A. Campbell,et al.  Progress in Artificial Intelligence , 1995, Lecture Notes in Computer Science.

[14]  Wilfred Ng,et al.  Applying Co-training to Clickthrough Data for Search Engine Adaptation , 2004, DASFAA.

[15]  SchwenkerFriedhelm,et al.  2010 Special Issue , 2010 .

[16]  Lifang Gu,et al.  Decision Models for Record Linkage , 2006, Selected Papers from AusDM.

[17]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[18]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[19]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[20]  Ion Muslea,et al.  Active Learning with Multiple Views , 2009, Encyclopedia of Data Warehousing and Mining.

[21]  George V. Moustakides,et al.  A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.

[22]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[23]  Graham J. Williams,et al.  Data Mining - Theory, Methodology, Techniques, and Applications , 2006, Lecture Notes in Computer Science.

[24]  Ran El-Yaniv,et al.  Large margin vs. large volume in transductive learning , 2008, Machine Learning.

[25]  Chao Deng,et al.  A new co-training-style random forest for computer aided diagnosis , 2011, Journal of Intelligent Information Systems.

[26]  Mikhail F. Kanevski,et al.  A Survey of Active Learning Algorithms for Supervised Remote Sensing Image Classification , 2011, IEEE Journal of Selected Topics in Signal Processing.

[27]  William E. Yancey Evaluating String Comparator Performance for Record Linkage , 2005 .

[28]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[29]  Dennis Shasha,et al.  Efficient data reconciliation , 2001, Inf. Sci..

[30]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[31]  Ran El-Yaniv,et al.  Large Margin vs. Large Volume in Transductive Learning , 2008, ECML/PKDD.

[32]  Ahmed K. Elmagarmid,et al.  Automating the approximate record-matching process , 2000, Inf. Sci..

[33]  Ling Qiu,et al.  Preserving privacy in association rule mining with bloom filters , 2006, Journal of Intelligent Information Systems.

[34]  U. M. Feyyad Data mining and knowledge discovery: making sense out of data , 1996 .

[35]  Günther Palm,et al.  Semi-supervised learning for tree-structured ensembles of RBF networks with Co-Training , 2010, Neural Networks.

[36]  Murat Sariyar,et al.  Evaluation of Record Linkage Methods for Iterative Insertions , 2009, Methods of Information in Medicine.

[37]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[38]  Michael Haber Fitting a General Log‐Linear Model , 1984 .

[39]  G. Niklas Norén,et al.  Duplicate detection in adverse drug reaction surveillance , 2007, Data Mining and Knowledge Discovery.

[40]  Murat Sariyar,et al.  Missing values in deduplication of electronic patient data , 2012, J. Am. Medical Informatics Assoc..

[41]  R. Tibshirani,et al.  Model Search by Bootstrap “Bumping” , 1999 .

[42]  Zehra Cataltepe,et al.  Co-training with relevant random subspaces , 2010, Neurocomputing.

[43]  Jianyi Guo,et al.  Question classification based on co-training style semi-supervised learning , 2010, Pattern Recognit. Lett..

[44]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[45]  N. Ohashi,et al.  Agreement , 2002 .

[46]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[47]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[48]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .

[49]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .