Bayesian Estimation of Bipartite Matchings for Record Linkage

ABSTRACT The bipartite record linkage task consists of merging two disparate datafiles containing information on two overlapping sets of entities. This is nontrivial in the absence of unique identifiers and it is important for a wide variety of applications given that it needs to be solved whenever we have to combine information from different sources. Most statistical techniques currently used for record linkage are derived from a seminal article by Fellegi and Sunter in 1969. These techniques usually assume independence in the matching statuses of record pairs to derive estimation procedures and optimal point estimators. We argue that this independence assumption is unreasonable and instead target a bipartite matching between the two datafiles as our parameter of interest. Bayesian implementations allow us to quantify uncertainty on the matching decisions and derive a variety of point estimators using different loss functions. We propose partial Bayes estimates that allow uncertain parts of the bipartite matching to be left unresolved. We evaluate our approach to record linkage using a variety of challenging scenarios and show that it outperforms the traditional methodology. We illustrate the advantages of our methods merging two datafiles on casualties from the civil war of El Salvador. Supplementary materials for this article are available online.

[1]  R M Bell,et al.  The Urge to Merge: Linking Vital Statistics Records and Medicaid Claims , 1994, Medical care.

[2]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[3]  Brunero Liseo,et al.  A hierarchical Bayesian approach to record linkage and population size problems , 2010, 1011.2649.

[4]  B. Liseo,et al.  On Bayesian Record Linkage , 2000 .

[5]  Michael D. Larsen,et al.  An Experiment with Hierarchical Bayesian Record Linkage , 2012, 1212.5203.

[6]  Martha E. Fair,et al.  Generalized Record Linkage System – Statistics Canada’s Record Linkage Software , 2004 .

[7]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[8]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[9]  Dennis Shasha,et al.  Efficient data reconciliation , 2001, Inf. Sci..

[10]  Taylor B. Seybolt,et al.  Counting civilian casualties : an introduction to recording and estimating nonmilitary deaths in conflict , 2013 .

[11]  Alan M Zaslavsky,et al.  A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs , 2013, Journal of the American Statistical Association.

[12]  William E. Winkler,et al.  AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIAL CENSUS , 1987 .

[13]  Nicholas E. Matsakis Active duplicate detection with Bayesian nonparametric models , 2010 .

[14]  Brunero Liseo,et al.  Bayesian estimation of population size via linkage of multivariate normal data sets , 2011 .

[15]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[16]  Peter Christen,et al.  Automatic record linkage using seeded nearest neighbour and support vector machine classification , 2008, KDD.

[17]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[18]  Radu Herbei,et al.  Classification with reject option , 2006 .

[19]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[20]  Nicholas P. Jewell,et al.  MSE and Casualty Counts , 2013 .

[21]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[22]  Panos M. Pardalos,et al.  Combinatorial Optimization Algorithms , 2013 .

[23]  Peter Christen,et al.  Probabilistic Data Generation for Deduplication and Data Linkage , 2005, IDEAL.

[24]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[25]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[26]  Rob Hall,et al.  A Bayesian Approach to Graphical Record Linkage and Deduplication , 2016 .

[27]  Brunero Liseo,et al.  A hierarchical Bayesian approach to record linkage and size population problems , 2010 .

[28]  M. Plummer,et al.  CODA: convergence diagnosis and output analysis for MCMC , 2006 .

[29]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[30]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[31]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .

[32]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[33]  Deborah Wagner,et al.  The Person Identification Validation System (PVS): Applying the Center for Administrative Records Research and Applications’ (CARRA) Record Linkage Software , 2014 .

[34]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[35]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[36]  Mauricio Sadinle,et al.  Detecting duplicates in a homicide registry using a Bayesian partitioning approach , 2014, 1407.8219.

[37]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[38]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[39]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[40]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[41]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[42]  Erica R.H. Fuchs,et al.  Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records , 2014 .

[43]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[44]  Bao-Gang Hu,et al.  What Are the Differences Between Bayesian Classifiers and Mutual-Information Classifiers? , 2011, IEEE Transactions on Neural Networks and Learning Systems.

[45]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[46]  Kristian Lum,et al.  Applications of Multiple Systems Estimation in Human Rights Research , 2013 .

[47]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[48]  Mauro Scanu,et al.  MODELLING ISSUES IN RECORD LINKAGE : A BAYESIAN PERSPECTIVE , 2002 .

[49]  Peter Christen,et al.  Accurate Synthetic Generation of Realistic Personal Information , 2009, PAKDD.

[50]  Stephen E. Fienberg,et al.  A Generalized Fellegi–Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems , 2012, 1205.3217.

[51]  T. Howland How El Rescate, a Small Nongovernmental Organization, Contributed to the Transformation of the Human Rights Situation in El Salvador , 2008 .

[52]  Peter Christen,et al.  Flexible and extensible generation and corruption of personal data , 2013, CIKM.

[53]  Jonathan Walker,et al.  The Crash Outcome Data Evaluation System (CODES) , 1996 .

[54]  M. Larsen Record Linkage Modeling in Federal Statistical Databases , 2010 .

[55]  Rebecca C. Steorts,et al.  Entity Resolution with Empirically Motivated Priors , 2014, 1409.0643.

[56]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[57]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[58]  N. S. D'Andrea Du Bois,et al.  A Solution to the Problem of Linking Multivariate Documents , 1969 .

[59]  Peter J. Green,et al.  Bayesian alignment using hierarchical models, with applications in protein bioinformatics , 2005 .

[60]  J B Copas,et al.  Record linkage: statistical models for matching computer records. , 1990, Journal of the Royal Statistical Society. Series A,.