Con ® dentiality , Uniqueness , and Disclosure Limitation for Categorical Data 1

Suppose a population of individuals is cross-classi®ed according to several categorical variables yielding a cell with an entry of ``1.'' Then we say that the individual corresponding to that ``1'' is unique in the population for these variables, or more succinctly is a population unique. Note that, in principle, if we use enough variables everyone in the population may be unique. Thus we presume that the data collection agency has been somewhat careful in its choice of a set of p variables to collect and the total number of cells in the resulting cross-classi®cation, K, is suf®ciently less than the population size, N, to make the problem of identifying population uniques statistically interesting. When does the existence of a population unique lead to a data disclosure problem related to a pledge of con®dentiality, e.g., not to release information collected from respondents in identi®able form? If a data release displays the information for an individual unique in the population, then an intruder will know that such an individual was included in the data base. An intruder who possesses matching data about a population unique has the potential to match his or her records against those in the data. This would lead to a formal violation of con®dentiality. Further, if a subset of variables lead to uniqueness in the population then by matching records the intruder may actually learn some additional information about the unique individual beyond that already in his or her ®les. When an agency releases data on individuals that are categorical in nature, the possible identi®cation of those who are unique or rare in the population is a concern because identity disclosure is deemed to be a violation of promises of con®dentiality. We review relationships among uniqueness in a sample, uniqueness in the population, and notions of disclosure, and then turn to methods for assessing disclosure potential as a result of sample uniqueness, especially using log-linear models.