Estimating Risks of Identification Disclosure in Microdata

When statistical agencies release microdata to the public, malicious users (intruders) may be able to link records in the released data to records in external databases. Releasing data in ways that fail to prevent such identifications may discredit the agency or, for some data, constitute a breach of law. To limit disclosures, agencies often release altered versions of the data; however, there usually remain risks of identification. This article applies and extends the framework developed by Duncan and Lambert for computing probabilities of identification for sampled units. It describes methods tailored specifically to data altered by recoding and topcoding variables, data swapping, or adding random noise (and combinations of these common data alteration techniques) that agencies can use to assess threats from intruders who possess information on relationships among variables and the methods of data alteration. Using data from the Current Population Survey, the article illustrates a step-by-step process for evaluating identification disclosure risks for competing releases under varying assumptions of intruders' knowledge. Risk measures are presented for individual units and for entire datasets.

[1]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[2]  I. Fellegi Report on Statistical Disclosure and Disclosure-Avoidance Techniques. (Statistical Policy Working Paper 2). , 1979 .

[3]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[4]  George T. Duncan,et al.  Disclosure-Limited Data Dissemination , 1986 .

[5]  G. Paass Disclosure Risk and Disclosure Avoidance for Microdata , 1988 .

[6]  D. Lambert,et al.  The Risk of Disclosure for Microdata , 1989 .

[7]  W. Keller,et al.  Disclosure control of microdata , 1990 .

[8]  Uwe Blien,et al.  Disclosure risk for microdata stemming from official statistics , 1992 .

[9]  C. J. Skinner,et al.  On identification disclosure and prediction disclosure for microdata , 1992 .

[10]  L. Zayatz,et al.  Strategies for measuring risk in public use microdata files , 1992 .

[11]  C. Skinner,et al.  Disclosure control for census microdata , 1994 .

[12]  S. Fienberg,et al.  A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data , 1997 .

[13]  S. Keller-McNulty,et al.  Estimation of Identi ® cation Disclosure Risk in Microdata , 1999 .

[14]  S. M. Samuels A Bayesian , Species-Sampling-Inspired Approach to the Uniques Problem in Microdata Disclosure Risk Assessment , 1999 .

[15]  Jeroen Pannekoek Statistical methods for some simple disclosure limitation rules , 1999 .

[16]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[17]  U. Rovira,et al.  Chapter 6 A Quantitative Comparison of Disclosure Control Methods for Microdata , 2001 .

[18]  A. Dale,et al.  Proposals for 2001 samples of anonymized records: An assessment of disclosure risk , 2001 .

[19]  C. Skinner,et al.  A measure of disclosure risk for microdata , 2002 .

[20]  Mark Elliot,et al.  Disclosure Risk Assessment , 2002 .

[21]  Stephen E. Fienberg,et al.  Modelling User Uncertainty for Disclosure Risk and Data Utility , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[22]  Nancy L. Spruill MEASURES OF CONFIDENTIALITY , 2002 .

[23]  Josep Domingo-Ferrer,et al.  Inference Control in Statistical Databases , 2002, Lecture Notes in Computer Science.

[24]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[25]  George T. Duncan,et al.  Disclosure Risk vs. Data Utility: The R-U Confidentiality Map , 2003 .

[26]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[27]  Katherine K. Wallman,et al.  Implementing the Confidential Information Protection and Statistical Efficiency Act of 2002 , 2004 .

[28]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[29]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[30]  Chris J. Skinner,et al.  The probability of identification: applying ideas from forensic statistics to disclosure risk assessment , 2007 .