Distribution-Preserving Statistical Disclosure Limitation

One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. A mis-specified imputation model can invalidate inferences because the distribution of synthetic data is completely determined by the model used to generate them. We present two practical methods of generating synthetic values when the imputer has only limited information about the true data generating process. One is applicable when the true likelihood is known up to a monotone transformation. The second requires only limited knowledge of the true likelihood, but nevertheless preserves the conditional distribution of the confidential data, up to sampling error, on arbitrary subdomains. Our method maximizes data utility and minimizes incremental disclosure risk up to posterior uncertainty in the imputation model and sampling error in the estimated transformation. We validate the approach with a simulation and application to a large linked employer-employee database.

[1]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[2]  George T. Duncan,et al.  Disclosure-Limited Data Dissemination , 1986 .

[3]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[4]  John Van Hoewyk,et al.  A multivariate technique for multiply imputing missing values using a sequence of regression models , 2001 .

[5]  Mark Elliot,et al.  Disclosure Risk Assessment , 2002 .

[6]  S. Fienberg,et al.  Con ® dentiality , Uniqueness , and Disclosure Limitation for Categorical Data 1 , 1999 .

[7]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[8]  Stephen E. Fienberg,et al.  Disclosure limitation using perturbation and related methods for categorical data , 1998 .

[9]  A. Kennickell Multiple Imputation and Disclosure Protection : TheCase of the 1995 Survey of Consumer Finances , 2000 .

[10]  Kevin L. McKinney,et al.  The measurement of human capital in the u , 2003 .

[11]  Jerome P. Reiter Estimating Risks of Identification Disclosure in Microdata , 2005 .

[12]  E. Vonesh,et al.  An Empirical Nonlinear Data-Fitting Approach for Transforming Data to Normality , 1989 .

[13]  George T. Duncan,et al.  Disclosure Risk vs. Data Utility: The R-U Confidentiality Map , 2003 .

[14]  Jerome P. Reiter,et al.  The Multiple Adaptations of Multiple Imputation , 2007 .

[15]  Jerome P. Reiter Significance tests for multi-component estimands from multiply imputed, synthetic microdata , 2005 .

[16]  Julia Lane,et al.  Integrated longitudinal employer-employee data for the United States , 2004 .

[17]  Ton de Waal,et al.  Statistical Disclosure Control in Practice , 1996 .

[18]  Rathindra Sarathy,et al.  A theoretical basis for perturbation methods , 2003, Stat. Comput..

[19]  Josep Domingo-Ferrer,et al.  Disclosure risk assessment in statistical microdata protection via advanced record linkage , 2003, Stat. Comput..

[20]  A. Carriquiry,et al.  A Semiparametric Transformation Approach to Estimating Usual Daily Intake Distributions , 1996 .

[21]  Mark J. Roberts,et al.  Producer Dynamics: New Evidence from Micro Data , 2009 .

[22]  R. Little,et al.  Selective Multiple Imputation of Keys for Statistical Disclosure Control in Microdata , 2003 .

[23]  Lars Vilhuber,et al.  The LEHD Infrastructure Files and the Creation of the Quarterly Workforce Indicators , 2009 .

[24]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[25]  William E. Winkler,et al.  Re-identification Methods for Masked Microdata , 2004, Privacy in Statistical Databases.

[26]  Simon D. Woodcock,et al.  Disclosure Limitation in Longitudinal Linked Data , 2002 .

[27]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[28]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[29]  Rathindra Sarathy,et al.  Perturbing Nonnormal Confidential Attributes: The Copula Approach , 2002, Manag. Sci..

[30]  Jerome P. Reiter,et al.  Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation , 2022 .

[31]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[32]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .