Active Learning for Probabilistic Record Linkage . ∗

Integrating information from multiple sources plays a key role in social science research. However, when a unique identifier that unambiguously links records is not available, merging datasets can be a difficult and error-prone endeavor. Probabilistic record linkage (PRL) aims to solve this problem by providing a framework in which common variables between datasets are used as potential identifiers, with the goal of producing a probabilistic estimate for the unobserved matching status across records. In this paper, I propose an active learning algorithm for PRL, which efficiently incorporates human judgement into the process and significantly improves PRL’s performance at the cost of manually labelling a small number of records. Using data from local politicians in Brazil, where a unique identifier is available for validation, I find that the proposed method bolsters the overall accuracy of the merging process. In addition, I examine data from a recent vote validation study conducted for the ANES, and I show that the proposed method can recover estimates that are indistinguishable from those obtained from a more extensive, expensive, and time-consuming clerical review.

[1]  Ted Enamorado,et al.  Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records , 2018, American Political Science Review.

[2]  Stephen Ansolabehere,et al.  Validation: What Big Data Reveal About Survey Misreporting and the Real Electorate , 2012, Political Analysis.

[3]  Gregory A. Huber,et al.  Representativeness and Motivations of the Contemporary Donorate: Results from Merged Survey and Administrative Records , 2017 .

[4]  S. Jackman,et al.  Why Does the American National Election Study Overestimate Voter Turnout? , 2018, Political Analysis.

[5]  Martin Johnson,et al.  The Influence of News Media on Political Elites: Investigating Strategic Responsiveness in Congress , 2016 .

[6]  Aditya G. Parameswaran,et al.  Active sampling for entity matching , 2012, KDD.

[7]  C. Huff,et al.  Wealth, Slaveownership, and Fighting for the Confederacy: An Empirical Study of the American Civil War , 2019, American Political Science Review.

[8]  Data Matching , 2017, Encyclopedia of Machine Learning and Data Mining.

[9]  Sophie Papst Business Survey Methods , 2016 .

[10]  Mauricio Sadinle,et al.  Bayesian Estimation of Bipartite Matchings for Record Linkage , 2016, 1601.06630.

[11]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[12]  Murat Sariyar,et al.  Missing values in deduplication of electronic patient data , 2012, J. Am. Medical Informatics Assoc..

[13]  K. Imai,et al.  Validating Self-Reported Turnout by Linking Public Opinion Surveys with Administrative Records , 2018, Public Opinion Quarterly.

[14]  Jared S. Murray Probabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering , 2015, J. Priv. Confidentiality.

[15]  Adam Bonica Are Donation-Based Measures of Ideology Valid Predictors of Individual-Level Policy Preferences? , 2017, The Journal of Politics.

[16]  J. Krosnick,et al.  Measuring Voter Registration and Turnout in Surveys , 2016 .

[17]  Stephen E. Fienberg,et al.  A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[18]  Cesar Zucco When Pay Outs Pay Off: Conditional Cash-Transfers and Voting Behavior in Brazil 2002-2010 , 2013 .

[19]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[20]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[21]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[22]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[23]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[24]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[25]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[26]  M. Sen,et al.  The Political Legacy of American Slavery , 2014, The Journal of Politics.

[27]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[28]  Pablo Barberá Birds of the Same Feather Tweet Together. Estimating Policy Positions Using Twitter Data. , 2013 .

[29]  Jared S. Murray,et al.  Practical Bayesian Inference for Record Linkage , 2017, 1710.10558.

[30]  Stefano DellaVigna,et al.  The Fox News Effect: Media Bias and Voting , 2006 .

[31]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[32]  Francesco Trebbi,et al.  Competition and Political Organization: Together or Alone in Lobbying for Trade Policy , 2009 .

[33]  In Song Kim Political Cleavages within Industry: Firm-level Lobbying for Trade Liberalization , 2017, American Political Science Review.

[34]  Yves Thibaudeau The Discrimination Power of Dependency Structures in Record Linkage , 1992 .

[35]  S. Stokes,et al.  Do Conditional Cash Transfers Affect Electoral Behavior? Evidence from a Randomized Experiment in Mexico , 2013 .

[36]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[37]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[38]  Murat Sariyar,et al.  Active learning strategies for the deduplication of electronic patient data using classification trees , 2012, J. Biomed. Informatics.

[39]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[40]  M. Meredith,et al.  The Politics of the Restoration of Ex-Felon Voting Rights: The Case of Iowa , 2015 .

[41]  Ali Yurukoglu,et al.  Bias in Cable News: Persuasion and Polarization , 2014 .

[42]  Stasha Ann Bown Larsen,et al.  Record Linkage , 2018, Encyclopedia of Database Systems.

[43]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[44]  R. Mooney,et al.  Learnable similarity functions and their application to record linkage and clustering , 2006 .

[45]  D. Hopkins,et al.  The Consequences of Broader Media Choice: Evidence from the Expansion of Fox News , 2014 .

[46]  Marianne Bertrand,et al.  Is it Whom You Know or What You Know? An Empirical Assessment of the Lobbying Process , 2011 .

[47]  M. Rueda Small Aggregates, Big Manipulation: Vote Buying Enforcement and Collective Monitoring , 2017 .

[48]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[49]  Bradley Spahn Before The American Voter , 2017 .

[50]  FastLink , 2018, Proceedings of the 21st ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems.