A Bayesian Approach for De-duplication in the Presence of Relational Data

In this paper we study the impact of combining profile and network data in a de-duplication setting. We also assess the influence of a range of prior distributions on the linkage structure, including our proposal. Our proposed prior makes it straightforward to specify prior believes and naturally enforces the microclustering property. Furthermore, we explore stochastic gradient Hamiltonian Monte Carlo methods as a faster alternative to obtain samples for the network parameters. Our methodology is evaluated using the RLdata500 data, which is a popular dataset in the record linkage literature.

[1]  George Casella,et al.  Cluster Analysis, Model Selection, and Prior Distributions on Models , 2014 .

[2]  Rebecca C. Steorts,et al.  Variational Bayes for Merging Noisy Databases , 2014, 1410.4792.

[3]  Hanna M. Wallach,et al.  Flexible Models for Microclustering with Application to Entity Resolution , 2016, NIPS.

[4]  Juan Sosa A Latent Space Approach for Cognitive Social Structures Modeling and Graphical Record Linkage , 2017 .

[5]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[6]  Jeffrey W. Miller,et al.  Mixture Models With a Prior on the Number of Components , 2015, Journal of the American Statistical Association.

[7]  Peter McCullagh,et al.  Stochastic classification models , 2006 .

[8]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[9]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[10]  W. Ewens The sampling theory of selectively neutral alleles. , 1972, Theoretical population biology.

[11]  P. Green,et al.  Bayesian Model-Based Clustering Procedures , 2007 .

[12]  Fernando A. Quintana,et al.  Nonparametric Bayesian inference in applications , 2017, Statistical Methods & Applications.

[13]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[14]  Peter Christen,et al.  Accurate Synthetic Generation of Realistic Personal Information , 2009, PAKDD.

[15]  Pedro M. Domingos Multi-Relational Record Linkage , 2003 .

[16]  Mauricio Sadinle,et al.  Detecting duplicates in a homicide registry using a Bayesian partitioning approach , 2014, 1407.8219.

[17]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[18]  Katherine A. Heller,et al.  An Alternative Prior Process for Nonparametric Bayesian Clustering , 2008, AISTATS.

[19]  Rob Hall,et al.  A Bayesian Approach to Graphical Record Linkage and Deduplication , 2016 .

[20]  Stephen E. Fienberg,et al.  A Generalized Fellegi–Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems , 2012, 1205.3217.

[21]  Jeffrey W. Miller,et al.  Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set , 2015, 1512.00792.

[22]  Stephen E. Fienberg,et al.  A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[23]  Harry Crane,et al.  The Ubiquitous Ewens Sampling Formula , 2016 .

[24]  Peter Christen,et al.  Flexible and extensible generation and corruption of personal data , 2013, CIKM.

[25]  Rebecca C. Steorts,et al.  Entity Resolution with Empirically Motivated Priors , 2014, 1409.0643.