Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

Since most social science research relies on multiple data sources, merging data sets is an essential part of researchers’ workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable, and data may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. We develop a fast and scalable algorithm to implement a canonical model of probabilistic record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. An open-source software package is available for implementing the proposed methodology.

[1]  Rebecca C. Steorts,et al.  Entity Resolution with Empirically Motivated Priors , 2014, 1409.0643.

[2]  Kosuke Imai,et al.  A Statistical Method for Empirical Testing of Competing Theories , 2011 .

[3]  Eitan Hersh Hacking the Electorate: How Campaigns Perceive Voters , 2015 .

[4]  Ted Enamorado,et al.  Active Learning for Probabilistic Record Linkage . ∗ , 2018 .

[5]  Jon A. Krosnick,et al.  Measuring Voter Registration and Turnout in Surveys Do Official Government Records Yield More Accurate Assessments , 2016 .

[6]  Yves Thibaudeau The Discrimination Power of Dependency Structures in Record Linkage , 1992 .

[7]  Paul J. Ferraro,et al.  Are Voters More Likely to Contribute to Other Public Goods? Evidence from a Large‐Scale Randomized Policy Experiment , 2014 .

[8]  Jared S. Murray,et al.  Practical Bayesian Inference for Record Linkage , 2017, 1710.10558.

[9]  Stephen E. Fienberg,et al.  A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[10]  Jerome P. Reiter,et al.  Regression Modeling and File Matching Using Possibly Erroneous Matching Variables , 2016, Journal of Computational and Graphical Statistics.

[11]  M. Hof,et al.  Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables , 2012, Statistics in medicine.

[12]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[13]  M. Brownell,et al.  Administrative record linkage as a tool for public health research. , 2011, Annual review of public health.

[14]  Fritz Scheuren,et al.  Regression Analysis of Data Files that Are Computer Matched , 1993 .

[15]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[16]  William E. Winkler,et al.  Record linkage , 2010 .

[17]  William E. Winkler,et al.  Fast Record Linkage of Very Large Files in Support of Decennial and Administrative Records Projects , 2010 .

[18]  Stephen Ansolabehere,et al.  Validation: What Big Data Reveal About Survey Misreporting and the Real Electorate , 2012, Political Analysis.

[19]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[20]  Stephen E. Fienberg,et al.  A Generalized Fellegi–Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems , 2012, 1205.3217.

[21]  John Neter,et al.  The Effect of Mismatching on the Measurement of Response Errors , 1965 .

[22]  Ǻ. Andersson,et al.  Returns to Education , 2009 .

[23]  David W. Nickerson,et al.  Political Campaigns and Big Data , 2014 .

[24]  Christophe G. Giraud-Carrier,et al.  Effective record linkage for mining campaign contribution data , 2014, Knowledge and Information Systems.

[25]  Jared S. Murray Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering , 2015, J. Priv. Confidentiality.

[26]  David C. Earnest,et al.  Do non-citizens vote in U.S. elections? , 2014 .

[27]  Robert Östling,et al.  Wealth, Health, and Child Development: Evidence from Administrative Data on Swedish Lottery Players , 2015 .

[28]  Christian Moser,et al.  Returns to Education through Access to Higher-Paying Firms: Evidence from US Matched Employer-Employee Data , 2017 .

[29]  Jonathan Levin,et al.  Economics in the age of big data , 2014, Science.

[30]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[31]  Mauricio Sadinle,et al.  Detecting duplicates in a homicide registry using a Bayesian partitioning approach , 2014, 1407.8219.

[32]  Jonathan Guryan,et al.  The Effects of Poor Neonatal Health on Children's Cognitive Development , 2014, The American economic review.

[33]  P. Deb Finite Mixture Models , 2008 .

[34]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[35]  Rob Hall,et al.  A Bayesian Approach to Graphical Record Linkage and Deduplication , 2016 .

[36]  Bill Bishop,et al.  The Big Sort: Why the Clustering of Like-Minded America Is Tearing Us Apart , 2008 .

[37]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[38]  Matthew A. Jaro UNIMATCH: a computer system for generalized record linkage under conditions of uncertainty , 1971, AFIPS '72 (Spring).

[39]  Murat Sariyar,et al.  Missing values in deduplication of electronic patient data , 2012, J. Am. Medical Informatics Assoc..

[40]  E. Zhuravskaya,et al.  Radio and the Rise of the Nazis in Prewar Germany , 2013, SSRN Electronic Journal.

[41]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[42]  Lisa M. Schilling,et al.  Improving record linkage performance in the presence of missing linkage data , 2014, J. Biomed. Informatics.

[43]  R Gutman,et al.  Error adjustments for file linking methods using encrypted unique client identifier (eUCI) with application to recently released prisoners who are HIV+ , 2016, Statistics in medicine.

[44]  Charles F. Manski,et al.  Regressions, Short and Long , 2002 .

[45]  Gunky Kim,et al.  Regression analysis under incomplete linkage , 2012, Comput. Stat. Data Anal..

[46]  W. Winkler SERIES ( Statistics # 2005-02 ) Approximate String Comparator Search Strategies for Very Large Administrative Lists , 2005 .

[47]  Brunero Liseo,et al.  A hierarchical Bayesian approach to record linkage and population size problems , 2010, 1011.2649.

[48]  James J. Feigenbaum,et al.  Automated Census Record Linking: A Machine Learning Approach , 2016 .

[49]  Murat Sariyar,et al.  Quality of record linkage in a highly automated cancer registry that relies on encrypted identity data , 2016 .

[50]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[51]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[52]  Stefano DellaVigna,et al.  The Fox News Effect: Media Bias and Voting , 2006 .

[53]  Mauricio Sadinle,et al.  Bayesian Estimation of Bipartite Matchings for Record Linkage , 2016, 1601.06630.

[54]  Harvey Goldstein,et al.  Methodological Developments in Data Linkage: Harron/Methodological Developments in Data Linkage , 2015 .

[55]  Michael Morse,et al.  Do Voting Rights Notification Laws Increase Ex-Felon Turnout? , 2014 .

[56]  Stephen Ansolabehere,et al.  ADGN: An Algorithm for Record Linkage Using Address, Date of Birth, Gender, and Name , 2017 .

[57]  Harvey Goldstein,et al.  Methodological Developments in Data Linkage , 2015 .

[58]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[59]  Stasha Ann Bown Larsen,et al.  Record Linkage , 2018, Encyclopedia of Database Systems.

[60]  Gregory A. Huber,et al.  Representativeness and Motivations of the Contemporary Donorate: Results from Merged Survey and Administrative Records , 2017 .

[61]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[62]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[63]  James J. Heckman,et al.  Handbook of Econometrics , 1985 .

[64]  Seth J. Hill Changing votes or changing voters? How candidates and election context swing voters and mobilize the base , 2017 .

[65]  Robert A. Moffitt,et al.  Chapter 75 The Econometrics of Data Combination , 2007 .

[66]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[67]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[68]  Clayton Nall,et al.  Why Partisans Do Not Sort: The Constraints on Political Segregation , 2017, The Journal of Politics.

[69]  William E. Yancey Evaluating String Comparator Performance for Record Linkage , 2005 .

[70]  Wendy K. Tam Cho,et al.  Voter Migration and the Geographic Sorting of the American Electorate , 2013 .