Statistical agencies are concerned about disclosure of con®dential information when data are released that can be identi®ed as referring to a small group of people. Disclosure can occur with tabular data releases if a cell corresponds to a very small group, for example, the cross-classi®cation of geography with a distinctive characteristic with many levels such as occupation. Similarly it can occur in a microdata release if variables in the data can be combined with publicly available information to identify the person to whom an individual record corresponds. Again, this is particularly likely when detailed geography is combined with a characteristic like occupation, although combinations of apparently innocuous variables such as age, sex, and race may also lead to disclosure. In either case, information reported for the identi®ed cell or microdata record (such as mean income for a cell or income for a record) can be associated with an individual or small group of individuals, violating the con®dentiality of their data. Common strategies for preventing such disclosures aim to limit reporting to aggregates consisting of some minimum number of individuals, so that tabular summaries or microdata records cannot be attached to individuals or small groups of individuals. For example, cells in a table may be suppressed or combined until a ®xed minimum number of cases is attained, or geographical detail may be limited to units exceeding a certain size. A nondisclosure policy for tabular data on microdata restricts release of information that could be related to a speci®c individual. Pannekoek and de Waal (1998) describe a rule that suppresses data release when the number of people in a cell de®ned by a rare characteristic falls below a ®xed ̄oor, and show how empirical Bayes methods can be used to improve the estimation of that number. We argue that the nondisclosure problem can be formulated as a decision problem in which one loss is associated with the possibility of disclosure and another with nonpublication of data. This analysis supports a decision on whether to disclose information in each cell, minimizing the expected sum of the two losses. We present arguments for several loss functions, considering both tabular and microdata releases, and illustrate their application to simple simulated data.
[1]
Chris J. Skinner,et al.
Estimating the re-identification risk per record in microdata
,
1998
.
[2]
C. Skinner,et al.
Disclosure control for census microdata
,
1994
.
[3]
Malay Ghosh,et al.
Small Area Estimation: An Appraisal
,
1994
.
[4]
L. Zayatz,et al.
Strategies for measuring risk in public use microdata files
,
1992
.
[5]
W. Keller,et al.
Disclosure control of microdata
,
1990
.
[6]
George T. Duncan,et al.
Disclosure-Limited Data Dissemination
,
1986
.
[7]
S. Keller-McNulty,et al.
Estimation of Identi ® cation Disclosure Risk in Microdata
,
1999
.
[8]
S. Fienberg,et al.
Con ® dentiality , Uniqueness , and Disclosure Limitation for Categorical Data 1
,
1999
.
[9]
S. M. Samuels.
A Bayesian , Species-Sampling-Inspired Approach to the Uniques Problem in Microdata Disclosure Risk Assessment
,
1999
.
[10]
Ton de Waal,et al.
Synthetic and combined estimators in statistical disclosure control
,
1998
.
[11]
D. Lambert.
Measures of Disclosure Risks and Harm
,
1993
.