This article considers the assessment of the risk of identification of respondents in survey microdata, in the context of applications at the United Kingdom (UK) Office for National Statistics (ONS). The threat comes from the matching of categorical “key“ variables between microdata records and external data sources and from the use of log-linear models to facilitate matching. While the potential use of such statistical models is well established in the literature, little consideration has been given to model specification or to the sensitivity of risk assessment to this specification. In numerical work not reported here, we have found that standard techniques for selecting log-linear models, such as chi-squared goodness-of-fit tests, provide little guidance regarding the accuracy of risk estimation for the very sparse tables generated by typical applications at ONS, for example, tables with millions of cells formed by cross-classifying six key variables, with sample sizes of 10 or 100,000. In this article we develop new criteria for assessing the specification of a log-linear model in relation to the accuracy of risk estimates. We find that, within a class of “reasonable“ models, risk estimates tend to decrease as the complexity of the model increases. We develop criteria that detect “underfitting“ (associated with overestimation of the risk). The criteria may also reveal “overfitting“ (associated with underestimation) although not so clearly, so we suggest employing a forward model selection approach. Our criteria turn out to be related to established methods of testing for overdispersion in Poisson log-linear models. We show how our approach may be used for both file-level and record-level measures of risk. We evaluate the proposed procedures using samples drawn from the 2001 UK Census where the true risks can be determined and show that a forward selection approach leads to good risk estimates. There are several “good“ models between which our approach provides little discrimination. The risk estimates are found to be stable across these models, implying a form of robustness. We also apply our approach to a large survey dataset. There is no indication that increasing the sample size necessarily leads to the selection of a more complex model. The risk estimates for this application display more variation but suggest a suitable upper bound.
[1]
Jerome P. Reiter.
Estimating Risks of Identification Disclosure in Microdata
,
2005
.
[2]
S. Fienberg,et al.
A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data
,
1997
.
[3]
K. Koehler.
Goodness-of-fit tests for log-linear models in sparse contingency tables
,
1986
.
[4]
Stephen E. Fienberg,et al.
Discrete Multivariate Analysis: Theory and Practice
,
1976
.
[5]
Stephen E. Fienberg,et al.
Modelling User Uncertainty for Disclosure Risk and Data Utility
,
2002,
Int. J. Uncertain. Fuzziness Knowl. Based Syst..
[6]
C. Skinner,et al.
Disclosure control for census microdata
,
1994
.
[7]
J. T. Wulu,et al.
Regression analysis of count data
,
2002
.
[8]
Adrian Dobra,et al.
Assessing the Risk of Disclosure of Confidential Categorical Data
,
2002
.
[9]
L. Willenborg,et al.
Elements of Statistical Disclosure Control
,
2000
.
[10]
Chris J. Skinner,et al.
Record level measures of disclosure risk for survey microdata
,
2006
.
[11]
Stephen E. Fienberg,et al.
Preserving the Confidentiality of Categorical Statistical Data Bases When Releasing Information for Association Rules*
,
2005,
Data Mining and Knowledge Discovery.
[12]
A. Dale,et al.
Proposals for 2001 samples of anonymized records: An assessment of disclosure risk
,
2001
.
[13]
W. Keller,et al.
Disclosure control of microdata
,
1990
.
[14]
C. Skinner,et al.
A measure of disclosure risk for microdata
,
2002
.
[15]
G. Paass.
Disclosure Risk and Disclosure Avoidance for Microdata
,
1988
.
[16]
Chris J. Skinner,et al.
Estimation of a measure of disclosure risk for survey microdata under unequal probability sampling
,
2003
.
[17]
J. N. K. Rao,et al.
Analysis of Categorical Response Data from Complex Surveys: An Appraisal and Update
,
2003
.
[18]
Chris J. Skinner,et al.
Estimating the re-identification risk per record in microdata
,
1998
.
[19]
D. Lambert,et al.
The Risk of Disclosure for Microdata
,
1989
.
[20]
Shelby J. Haberman,et al.
Log-Linear Models and Frequency Tables with Small Expected Cell Counts
,
1977
.
[21]
P. Doyle,et al.
Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies
,
2001
.