Random Partition Models for Microclustering Tasks

Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points -- the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of entity resolution, where we provide a simulation study and real experiments on survey panel data.

[1]  Justin D. Silverman,et al.  The Bayesian Sorting Hat: A Decision-Theoretic Approach to Size-Constrained Clustering , 2017 .

[2]  Yee Whye Teh,et al.  Sampling and Inference for Beta Neutral-to-the-Left Models of Sparse Networks , 2018, UAI.

[3]  Rebecca C. Steorts,et al.  Entity Resolution with Empirically Motivated Priors , 2014, 1409.0643.

[4]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[5]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[6]  Yee Whye Teh,et al.  Nonexchangeable random partition models for microclustering , 2017, The Annals of Statistics.

[7]  Hanna M. Wallach,et al.  Flexible Models for Microclustering with Application to Entity Resolution , 2016, NIPS.

[8]  Giacomo Zanella Random partition models and complementary clustering of Anglo-Saxon place-names , 2015 .

[9]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[10]  J. Pitman,et al.  Exchangeable Gibbs partitions and Stirling triangles , 2004, math/0412494.

[11]  J. Kingman The Representation of Partition Structures , 1978 .

[12]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[13]  D. Dunson,et al.  Theoretical Limits of Record Linkage and Microclustering , 2017, 1703.04955.

[14]  S. Walker Invited comment on the paper "Slice Sampling" by Radford Neal , 2003 .

[15]  Peter Orbanz,et al.  Subsampling large graphs and invariance in networks , 2017, 1710.04217.

[16]  Arto Klami,et al.  Probabilistic Size-constrained Microclustering , 2016, UAI.

[17]  D B Dunson,et al.  Theoretical limits of microclustering for record linkage , 2018, Biometrika.

[18]  Benjamin I. P. Rubinstein,et al.  Principled Graph Matching Algorithms for Integrating Multiple Data Sources , 2014, IEEE Transactions on Knowledge and Data Engineering.

[19]  Alan M Zaslavsky,et al.  A Bayesian Procedure for File Linking to Analyze End-of-Life Medical Costs , 2013, Journal of the American Statistical Association.

[20]  Mauricio Sadinle,et al.  Detecting duplicates in a homicide registry using a Bayesian partitioning approach , 2014, 1407.8219.

[21]  J. Pitman Combinatorial Stochastic Processes , 2006 .

[22]  William W. Cohen,et al.  A Comparison of String Metrics for Matching Names and Records , 2003 .

[23]  Benjamin I. P. Rubinstein,et al.  In Search of an Entity Resolution OASIS: Optimal Asymptotic Sequential Importance Sampling , 2017, Proc. VLDB Endow..

[24]  Cyrus Rashtchian,et al.  Clustering Billions of Reads for DNA Data Storage , 2017, NIPS.

[25]  Rebecca C. Steorts,et al.  Performance Bounds for Graphical Record Linkage , 2017, AISTATS.

[26]  M. Plummer,et al.  CODA: convergence diagnosis and output analysis for MCMC , 2006 .

[27]  Brunero Liseo,et al.  A hierarchical Bayesian approach to record linkage and population size problems , 2010, 1011.2649.

[28]  Rob Hall,et al.  A Bayesian Approach to Graphical Record Linkage and Deduplication , 2016 .

[29]  Brunero Liseo,et al.  Some advances on Bayesian record linkage and inference for linked data , 2011 .

[30]  Giacomo Zanella,et al.  Informed Proposals for Local MCMC in Discrete Spaces , 2017, Journal of the American Statistical Association.

[31]  Arto Klami,et al.  On Controlling the Size of Clusters in Probabilistic Clustering , 2018, AAAI.

[32]  Radford M. Neal Slice Sampling , 2003, The Annals of Statistics.

[33]  Nikolaos Limnios,et al.  Semi-Markov Chains and Hidden Semi-Markov Models toward Applications: Their Use in Reliability and DNA Analysis , 2008 .

[34]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[35]  A. Rinaldo,et al.  CONSISTENCY UNDER SAMPLING OF EXPONENTIAL RANDOM GRAPH MODELS. , 2011, Annals of statistics.

[36]  R. P. Chakrabarty,et al.  SURVEY OF INCOME AND PROGRAM PARTICIPATION , 1990 .

[37]  V. F. Kolchin A problem of the Allocation of Particles in Cells and Cycles of Random Permutations , 1971 .

[38]  Xiyun Jiao,et al.  Metropolis-Hastings Within Partially Collapsed Gibbs Samplers , 2013, 1309.3217.

[39]  W. Dempsey,et al.  Edge Exchangeable Models for Interaction Networks , 2018, Journal of the American Statistical Association.

[40]  Peter Christen,et al.  Automatic record linkage using seeded nearest neighbour and support vector machine classification , 2008, KDD.

[41]  Lancelot F. James,et al.  Generalized weighted Chinese restaurant processes for species sampling mixture models , 2003 .

[42]  Arto Klami,et al.  Few-to-few Cross-domain Object Matching , 2017, AMBN.

[43]  S. H. Long,et al.  SURVEY OF INCOME AND PROGRAM PARTICIPATION , 1990 .

[44]  S. Walker,et al.  Frequency of Frequencies Distributions and Size-Dependent Exchangeable Random Partitions , 2016, 1608.00264.

[45]  Trevor Campbell,et al.  Edge-exchangeable graphs and sparsity , 2016, NIPS.

[46]  Murat Sariyar,et al.  The RecordLinkage Package: Detecting Errors in Data , 2010, R J..

[47]  J. Pitman,et al.  Characterizations of exchangeable partitions and random discrete distributions by deletion properties , 2009, 0909.3642.

[48]  J B Copas,et al.  Record linkage: statistical models for matching computer records. , 1990, Journal of the Royal Statistical Society. Series A,.