The paper discusses some aspects of the individual risk methodology, that was initially proposed by Benedetti and Franconi (1998). The original formulation defines a record-level measure of re-identification, called the individual risk , that can be estimated exploiting information on the sampling design. This methodology is currently implemented in the testing version of the software μ-Argus, developed under the European project CASC. When dealing with social surveys, that is the context where the individual risk methodology is best suited, it is reasonable to hypothesise that identification, which consists of linking a sample unit to a population unit, is performed based on a set of known identifying or key categorical variables. This implies that the individual risk depends on the joint distribution of the key variables, e.g. on the size fk, Fk of subgroups of units having a given combination of key variables in the sample and population, respectively. Unlike the approaches that propose a record-level measure of risk based on the concept of sample uniques (e.g. Skinner and Elliot, 2002), the risk is defined for any record in the sample. The measure also differs from those based on the sample frequency of combinations, because inference on the sizes Fk of population subgroups is performed. The method shares with the above mentioned strategies the inferential nature and the approach to protection, respectively. Indeed, having estimated the individual risk for each record in the sample, protection is ensured by applying local suppression to high risk individuals only. The paper discusses the formalisation of the individual risk function for files of independent units. Upon defining the disclosure scenario, the individual risk measure is linked to the probability of re-identification of a single record given information on a set of key variables observed on the whole population. Based on such connection, an overall measure of risk, called the re-identification rate, is proposed. Although this is a measure at the file level like the ones discussed by Skinner and Elliot (2002), it exploits the probability of re -identification of each sampled record. In particular, it is defined in terms of the expected number of re-identifications in the file to be released. Whenever the individual risk methodology is used to protect a sample by local suppression, the user is requested to select a risk threshold that classifies individuals into safe or unsafe. The paper investigates how the re-identification rate may be exploited for selection of a proper risk threshold using a measure of target “safety” of the whole file.
[1]
M. Trottini.
A Decision-Theoretic Approach to Data Disclosure Problems
,
2001
.
[2]
L. Franconi,et al.
UNITED NATIONS STATISTICAL COMMISSION and ECONOMIC COMMISSION FOR EUROPE CONFERENCE OF EUROPEAN STATISTICIANS EUROPEAN COMMISSION STATISTICAL OFFICE OF THE EUROPEAN COMMUNITIES (EUROSTAT) Joint ECE/Eurostat work session on statistical data confidentiality
,
2003
.
[3]
Luisa Franconi,et al.
Statistical and Technological Solutions for Controlled Data Dissemination
,
1998
.
[4]
Carl-Erik Särndal,et al.
Model Assisted Survey Sampling
,
1997
.
[5]
Adrian Dobra,et al.
Assessing the Risk of Disclosure of Confidential Categorical Data
,
2002
.
[6]
C. Skinner,et al.
A measure of disclosure risk for microdata
,
2002
.