Theoretical limits of microclustering for record linkage

&NA; There has been substantial recent interest in record linkage, where one attempts to group the records pertaining to the same entities from one or more large databases that lack unique identifiers. This can be viewed as a type of microclustering, with few observations per cluster and a very large number of clusters. We show that the problem is fundamentally hard from a theoretical perspective and, even in idealized cases, accurate entity resolution is effectively impossible unless the number of entities is small relative to the number of records and/or the separation between records from different entities is extremely large. These results suggest conservatism in interpretation of the results of record linkage, support collection of additional data to more accurately disambiguate the entities, and motivate a focus on coarser inference. For example, results from a simulation study suggest that sometimes one may obtain accurate results for population size estimation even when fine‐scale entity resolution is inaccurate.

[1]  Rebecca C. Steorts,et al.  Entity Resolution with Empirically Motivated Priors , 2014, 1409.0643.

[2]  Prem K. Goel,et al.  Estimation of the Correlation Coefficient from a Broken Random Sample , 1980 .

[3]  Kristian Lum,et al.  Applications of Multiple Systems Estimation in Human Rights Research , 2013 .

[4]  Jeffrey W. Miller,et al.  Microclustering: When the Cluster Sizes Grow Sublinearly with the Size of the Data Set , 2015, 1512.00792.

[5]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[6]  Matthew Crosby,et al.  Association for the Advancement of Artificial Intelligence , 2014 .

[7]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[8]  Mauricio Sadinle,et al.  Detecting duplicates in a homicide registry using a Bayesian partitioning approach , 2014, 1407.8219.

[9]  Peter Christen,et al.  Data Matching , 2012, Data-Centric Systems and Applications.

[10]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[11]  D. Dunson,et al.  Nonparametric Bayes Modeling of Multivariate Categorical Data , 2009, Journal of the American Statistical Association.

[12]  G. M. Tallis,et al.  Identifiability of mixtures , 1982, Journal of the Australian Mathematical Society. Series A. Pure Mathematics and Statistics.

[13]  Marcello D'Orazio,et al.  Statistical Matching: Theory and Practice , 2006 .

[14]  Dongwon Lee,et al.  Blocking-aware private record linkage , 2005, IQIS '05.

[15]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[16]  Hanna M. Wallach,et al.  Flexible Models for Microclustering with Application to Entity Resolution , 2016, NIPS.

[17]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[18]  Stephen E. Fienberg,et al.  A Generalized Fellegi–Sunter Framework for Multiple Record Linkage With Application to Homicide Record Systems , 2012, 1205.3217.

[19]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[20]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[21]  Richard A. Griffin,et al.  Potential Uses of Administrative Records for Triple System Modeling for Estimation of Census Coverage Error in 2020 , 2014 .

[22]  K. Wolter Some coverage error models for census data. , 1986, Journal of the American Statistical Association.

[23]  Rob Hall,et al.  A Bayesian Approach to Graphical Record Linkage and Deduplication , 2016 .

[24]  D. Rubin,et al.  Testing the number of components in a normal mixture , 2001 .

[25]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[26]  Rob Hall,et al.  A Bayesian Approach to Graphical Record Linkage and Deduplication , 2013, AISTATS.

[27]  Raymond J. Mooney,et al.  Adaptive Blocking: Learning to Scale Up Record Linkage , 2006, Sixth International Conference on Data Mining (ICDM'06).

[28]  Hajo Holzmann,et al.  Identifiability of Finite Mixtures of Elliptical Distributions , 2006 .

[29]  A. Zaslavsky,et al.  Triple-System Modeling of Census, Post-Enumeration Survey, and Administrative-List Data , 1993 .

[30]  Terrence J. Sejnowski,et al.  Unsupervised Learning , 2018, Encyclopedia of GIS.

[31]  S. Yakowitz,et al.  On the Identifiability of Finite Mixtures , 1968 .

[32]  N. E. Day Estimating the components of a mixture of normal distributions , 1969 .

[33]  S. E. Fienberg,et al.  Maximum Likelihood Estimation in Latent Class Models For Contingency Table Data , 2007, 0709.3535.

[34]  David B Dunson,et al.  TENSOR DECOMPOSITIONS AND SPARSE LOG-LINEAR MODELS. , 2014, Annals of statistics.