Single-Ranking Micro-aggregation and Re-identification
暂无分享,去创建一个
This paper shows that it is possible to create metrics for which re-identification is straightforward for situations in which continuous variables have been micro-aggregated one at a time using conventional methods. Introduction The purpose of this document is to provide overall justification on how micro-aggregation generally can yield public-use files in which re-identification rates are extraordinarily high. This work provides intuition and reasoning that supplements empirical evidence suggesting that re-identification rates may be high with standard microaggregation (see e.g., Domingo-Ferrer and Mateo-Sanz 2001, Domingo-Ferrer et al. 2002). Much of the earlier work on micro-aggregation concentrated on the degradation of analytic properties. The earlier work typically did not contain re-identification experiments. Much of the previous micro-aggregation work (Domingo-Ferrer and Mateo-Sanz 2001, 2002; Defays and Anwar 1998) has generally concentrated on analytic properties (such as regression). The general rules are that the data values associated with individual variables in records will be put in groups of approximately size k where k is typically between 3 and 10. The original values are replaced with micro-aggregates, typically the averages within groups. As observed by Domingo et al. and others, as k increases toward 10, the analytic properties (i.e., regressions, etc.) can deteriorate severely. Generally, to reduce the deterioration of the analytic utility of the microaggregated data, the papers have taken k=3 or k=4. Although re-identification experiments were typically not done, the assumption was that re-identification would be more difficult as k increases. The re-identification understanding, however, was based on using variables in isolation from one another (i.e. single-ranking microaggregation). It did not consider looking at combinations of variables as might be done with nearest-neighbor matching (exceptions being the recent work of Domingo et al. 2001). We do not consider the deterioration of analytic properties due to a combination of micro-aggregation and sampling. Domingo-Ferrer and Mateo-Sanz (2002) have shown that it is possible to micro-aggregate on several variables at once. Although the procedures are more difficult theoretically and computationally, they provide lower reidentification rates at the same values of k than the single-variable aggregation methods. The multi-variable aggregation can cause more severe deterioration of analytic properties. We do not consider multi-variable aggregation in this paper. Basic Situation: Identifying a Micro-aggregated Database Against the Original Database We consider a rectangular data base (table) having fields (variables) Xi i=1,...,n and value states xij, j=1,...,ni. In many microdata confidentiality experiments, users want 10 or more variables Xi. We assume that each of the variables Xi is continuous, skewed, and not taking zero value states. The second assumption eliminates a few additional technical details. It can easily be eliminated. The third assumption is for convenience. It is not generally needed for the arguments that follow. We begin our discussion by considering databases with 1000 or more records and situations in which microaggregation is on one variable at a time. Although sampling may reduce re-identification rates in some situations, it can also cause severe additional deterioration in analytic properties. We do not consider the deterioration of analytic properties due to a combination of micro-aggregation and sampling. In this discussion, we demonstrate that micro-aggregation as currently practiced allows almost perfect reidentification with existing record linkage procedures even when k is greater than or equal 10. We can easily develop nearest-neighbor methods with similar metrics that have almost 100 percent re-identification rates. We chose any three variables, say X1, X2, and X3 that are pairwise uncorrelated (R 2 <= 0.2). Our procedure is for aggregating variables one at a time. Within each variable, sort the values and aggregate into groups of size 3 or more. Let the new micro-aggregated value-states be denoted by a(xij)= yij, j = 1, .... ki , i = 1, 2, 3 where a() is the aggregation function. Each (aggregated) value state is assumed three or more times (3 or more records have the same value of the y-variables). Most aggregates will be from three value-states only. In the following y iji will denote the ji value-state of micro-aggregated variable Yi. The micro-aggregated value y iji will be a value such as the average or median. Such a value is in the range of the values being micro-aggregated. We develop new record linkage metrics (or nearest neighbor metrics) as follows. The metrics are for matching a micro-aggregated record R with the original set of data records. Let R = (y1j1, y2j2, y3j3 ) = (a(x1k1), a(x2k2), a(x3k3)) where y i’s are values aggregated by the aggregation operator a(.) from original values xi’s. Using the sort ordering for individual variables, for each i, let p(y iji) be the predecessor of y iji and s(y iji) be the successor of y iji. In each situation, the predecessor and the successor are distinct from the value y iji. For y jii, let the distance be metric dist (x, y iji) be 1 if x in within distance min (abs (y iji – p(y iji)), abs(y iji-s(y iji))/2 of y iji; 0, otherwise. This allows us to match the Xvariables in the original file with the Y-values in the micro-aggregated file. Suitable adjustments should be made for being at the end of the distributions (i.e., one-sided). Let N be the number of records in the original database. Then micro-aggregated record R has probability close to one of matching with its true corresponding original record. The probability is at least ((N-3)/N) on each field. It has probability close to zero of matching with any record other than its original corresponding record on each field. We repeat the above argument. If micro-aggregated record R is matched against the original data using only variable X1, then it can be matched against at most three records. The correct match is within the three records. Matching on variable X1 quickly eliminates N-3 records from consideration. If we now match on variable X2, there is a virtual certainty that we can identify the single record (of three) that R correctly matches. The intuition is that if record R matches on the first variable, then there are at most three records in the original data meeting that criterion (one of which is correct). The same thing happens on the second field; the same on the third. Typically, after two variables are compared, record R can be correctly matched. If k is increased from 3 to 10, then it is very straightforward to create new optimized metrics. Re-identification rates are still likely to be 100%. Programming of the new metrics is exceptionally straightforward. One sorts on a variable, aggregates, and computes the new metric. The new metric is highly optimized for the given data and micro-aggregation procedure. Adaptation of the general matching (re-identification) software is also exceptionally straightforward. First Extension: Identifying a 1% Sample of Micro-aggregated Data Against the Original Database In this extension, we begin with a database D of 100,000 records having ten continuous variables. Again, for convenience, we assume that each of the variables Xi is continuous, skewed, and not taking zero value states. We aggregate in groups of approximately size k=3. We create a sample S containing 1% of the records. Again, we chose any three variables, say X1, X2, and X3 that are pairwise uncorrelated (R 2 <= 0.2). Let R = (y1j1, y2j2, y3j3 ) = (a(x1k1), a(x2k2), a(x3k3)) where y i’s are values aggregated by the aggregation operator a(.) from original values xi’s. At this point, we use intuition from the first, much easier example. Pair record R with the approximately nine closest records in D. The pairing is according to the distance between the x1k1 values and y1j1. Again, at least one of these nine will contain the correct match. Within the nine, compare the x2k2 –values with y2j2 to determine the plausible correct match. If the value y2j2 is not sufficient, use the remaining value y3j3. Within three iterations (i.e., use of three variables), the correct match will be obtained. Repeat for all micro-aggregated records R until 100% of the micro-aggregated records have been correctly matched to their corresponding record in the population file D. Second Extension: Identifying a 1% Sample of Micro-aggregated Data Against a Corresponding Database By a corresponding database, we will mean a database D’ that corresponds to D and is available to the intruder. We assume that it also contains 10 variables and that identifying information such as name is available in D’. If we can match a record in D’ against a record in the micro-aggregated sample S, then a re-identification occurs. We assume that at most three variables in each record in D’ have values that deviate by 30% from their corresponding values in D. We assume that the remaining variables in records deviate by at most 1-3% from the corresponding values in D. We consider restrictions similar to the previous two examples. We create a sample S containing 1% of the records. This time we use all ten variables. We only use some of the ideas from the previous example. Let R = (y1j1, y2j2, ..., y10j10 ) = (a(x1k1), a(x2k2), ..., a(x10k10)) where y i’s are values aggregated by the aggregation operator a(.) from original values xi’s. For each variable Xi, i = 1, ..., 10, we sequentially match record R as follows. Choose a group Gi of 360 records that agree most closely with y iji. Let r’ in D’ be the record that matches R most closely in seven of the ten fields. By our previous reasoning, there will be a unique record in D’ that agrees with R. Although record R will not agree with r’ in D’ on three fields, we can still find it. The redundanc
[1] D. Defays,et al. Masking Microdata Using Micro-Aggregation , 1999 .
[2] Josep Domingo-Ferrer,et al. Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..
[3] Josep Domingo-Ferrer,et al. On the Security of Microaggregation with Individual Ranking: Analytical Attacks , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..