General Discrete-data Modeling Methods for Producing Synthetic Data with Reduced Re-identification Risk that Preserve Analytic Properties

General modeling methods for representing and improving the quality of discrete data (Winkler 2003, 2008) extend and connect the editing methods of Fellegi and Holt (1976) and the imputation ideas of Little and Rubin (2002). This paper describes a modeling framework to produce synthetic microdata that better corresponds to external benchmark constraints on certain aggregates (such as margins) and on which certain cell probabilities are bounded both below and above to reduce re-identification risk. Rather than use linear constraints (Meng and Rubin 1993), the modeling methods use convex constraints (Winkler 1990, 1993) in an extended MCECM procedure. Although the produced microdata are not epsilon-private (Dwork 2006, Dwork and Yekhanin 2008), surrogate original microdata would be exceptionally difficult (or impossible) to construct using the standard lp programming procedures of epsilon-privacy.

[1]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[2]  P. Holland,et al.  Discrete Multivariate Analysis. , 1976 .

[3]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[4]  William E. Winkler On Dykstra's Iterative Fitting Procedure , 1990 .

[5]  A. Agresti An introduction to categorical data analysis , 1997 .

[6]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[7]  W. Winkler,et al.  MASKING MICRODATA FILES , 1995 .

[8]  Karen A. F. Copeland An Introduction to Categorical Data Analysis , 1997 .

[9]  William E. Winkler,et al.  Re-identification Methods for Evaluating the Confidentiality of Analytically Valid Microdata , 1998 .

[10]  Chuanhai Liu Estimation of Discrete Distributions with a Class of Simplex Constraints , 2000 .

[11]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[12]  Marcello D'Orazio,et al.  Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints , 2006 .

[13]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[14]  Cynthia Dwork,et al.  The price of privacy and the limits of LP decoding , 2007, STOC '07.

[15]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[16]  Adam D. Smith,et al.  Composition attacks and auxiliary information in data privacy , 2008, KDD.

[17]  Yufei Tao,et al.  Output perturbation with query relaxation , 2008, Proc. VLDB Endow..

[18]  William E. Winkler,et al.  General Methods and Algorithms for Modeling and Imputing Discrete Data under a Variety of Constraints , 2008 .

[19]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[20]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[21]  Lars Vilhuber,et al.  How Protective Are Synthetic Data? , 2008, Privacy in Statistical Databases.

[22]  Cynthia Dwork,et al.  New Efficient Attacks on Statistical Disclosure Control Mechanisms , 2008, CRYPTO.