Estimating Identification Disclosure Risk Using Mixed Membership Models

Statistical agencies and other organizations that disseminate data are obligated to protect data subjects’ confidentiality. For example, ill-intentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence, as part of their assessments of disclosure risks, many data stewards estimate the probabilities that sample uniques on sets of discrete keys are also population uniques on those keys. This is typically done using log-linear modeling on the keys. However, log-linear models can yield biased estimates of cell probabilities for sparse contingency tables with many zero counts, which often occurs in databases with many keys. This bias can result in unreliable estimates of probabilities of uniqueness and, hence, misrepresentations of disclosure risks. We propose an alternative to log-linear models for datasets with sparse keys based on a Bayesian version of grade of membership (GoM) models. We present a Bayesian GoM model for multinomial variables and offer a Markov chain Monte Carlo algorithm for fitting the model. We evaluate the approach by treating data from a recent U.S. Census Bureau public use microdata sample as a population, taking simple random samples from that population, and benchmarking estimated probabilities of uniqueness against population values. Compared to log-linear models, GoM models provide more accurate estimates of the total number of uniques in the samples. Additionally, they offer record-level predictions of uniqueness that dominate those based on log-linear models. This article has online supplementary materials.

[1]  Jonathan J. Forster,et al.  Bayesian disclosure risk assessment: predicting small frequencies in contingency tables , 2007 .

[2]  L. Zayatz,et al.  Strategies for measuring risk in public use microdata files , 1992 .

[3]  Jeroen Pannekoek Statistical methods for some simple disclosure limitation rules , 1999 .

[4]  Natalie Shlomo,et al.  Assessing Identification Risk in Survey Microdata Using Log-Linear Models , 2008 .

[5]  Natalie Shlomo,et al.  Assessing the protection provided by misclassification-based disclosure limitation methods for survey microdata , 2010, 1011.2905.

[6]  S. Sullivant,et al.  Emerging applications of algebraic geometry , 2009 .

[7]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[8]  Mark Elliot,et al.  Disclosure Risk Assessment , 2002 .

[9]  Elena A. Erosheva,et al.  Grade of membership and latent structure models with application to disability survey data , 2002 .

[10]  Variances and Confidence Intervals for Sample Disclosure Risk Measures , .

[11]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[12]  S. Fienberg,et al.  DESCRIBING DISABILITY THROUGH INDIVIDUAL-LEVEL MIXTURE MODELS FOR MULTIVARIATE BINARY DATA. , 2007, The annals of applied statistics.

[13]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: dominant markers and null alleles , 2007, Molecular ecology notes.

[15]  M. Woodbury,et al.  Mathematical typology: a grade of membership technique for obtaining disease definition. , 1978, Computers and biomedical research, an international journal.

[16]  Jerome P. Reiter Estimating Risks of Identification Disclosure in Microdata , 2005 .

[17]  Nicholas Eriksson,et al.  Polyhedral conditions for the nonexistence of the MLE for hierarchical log-linear models , 2006, J. Symb. Comput..

[18]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[19]  W. Keller,et al.  Disclosure control of microdata , 1990 .

[20]  S. M. Samuels A Bayesian , Species-Sampling-Inspired Approach to the Uniques Problem in Microdata Disclosure Risk Assessment , 1999 .

[21]  Kenneth G. Manton,et al.  Statistical applications using fuzzy sets , 1994 .

[22]  Juan José SALAZAR-GONZÁLEZ,et al.  Statistical Confidentiality: Principles and Practice , 2011 .

[23]  S. Keller-McNulty,et al.  Estimation of Identi ® cation Disclosure Risk in Microdata , 1999 .

[24]  S. Fienberg,et al.  Discovering Latent Patterns with Hierarchical Bayesian Mixed-Membership Models , 2006 .

[25]  Chris J. Skinner,et al.  Record level measures of disclosure risk for survey microdata , 2006 .

[26]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[27]  B. Junker,et al.  ITEM RESPONSE THEORY: PAST PERFORMANCE, PRESENT DEVELOPMENTS, AND FUTURE EXPECTATIONS , 2006 .

[28]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[29]  A. Dale,et al.  Proposals for 2001 samples of anonymized records: An assessment of disclosure risk , 2001 .

[30]  C. J. Skinner,et al.  On identification disclosure and prediction disclosure for microdata , 1992 .

[31]  E. Erosheva Comparing Latent Structures of the Grade of Membership, Rasch, and Latent Class Models , 2005 .

[32]  Jörg Drechsler,et al.  Accounting for Intruder Uncertainty Due to Sampling When Estimating Identification Disclosure Risks in Partially Synthetic Data , 2008, Privacy in Statistical Databases.

[33]  Kathleen Cronin,et al.  Disclosure Risk Assessment for Population-based Cancer Microdata , 2011 .

[34]  Steven Ruggles,et al.  Integrated Public Use Microdata Series: Version 3 , 2003 .

[35]  Chris J. Skinner,et al.  Estimating the re-identification risk per record in microdata , 1998 .

[36]  Sajeev Varki,et al.  Using the conditional grade-of-membership model to assess judgment accuracy , 2003 .

[37]  I. C. Gormley,et al.  A mixture of experts model for rank data with applications in election studies , 2008, 0901.4203.

[38]  L. A. Goodman Exploratory latent structure analysis using both identifiable and unidentifiable models , 1974 .

[39]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[40]  Isobel Claire Gormley,et al.  Statistical models for rank data , 2007 .

[41]  Seth Sullivant,et al.  Algebraic statistics , 2018, ISSAC.

[42]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[43]  S. Ruggles Integrated Public Use Microdata Series , 2021, Encyclopedia of Gerontology and Population Aging.

[44]  A. Rinaldo,et al.  Algebraic Statistics and Contingency Table Problems: Log-Linear Models, Likelihood Estimation, and Disclosure Limitation , 2009 .

[45]  P. Rosenbaum,et al.  Conditional Association and Unidimensionality in Monotone Latent Variable Models , 1985 .

[46]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[47]  Marianne Bertolet To weight or not to weight? Incorporating sampling designs into model-based analyses , 2008 .

[48]  C. Skinner,et al.  Disclosure control for census microdata , 1994 .

[49]  S. Fienberg,et al.  Population Size Estimation Using Individual Level Mixture Models , 2008, Biometrical journal. Biometrische Zeitschrift.

[50]  S. Fienberg,et al.  Alternative statistical models and representations for large sparse multi-dimensional contingency tables (∗) , 2002 .