A Bayesian Hierarchical Model Approach to Risk Estimation in Statistical Disclosure Limitation

When microdata files for research are released, it is possible that external users may attempt to breach confidentiality. For this reason most National Statistical Institutes apply some form of disclosure risk assessment and data protection. Risk assessment first requires a measure of disclosure risk to be defined. In this paper we build on previous work by [BF98] to define a Bayesian hierarchical model for risk estimation. We follow a superpopulation approach similar to [BKP90] and [Rin03]. For each combination of values of the key variables we derive the posterior distribution of the population frequency given the observed sample frequency. Knowledge of this posterior distribution enables us to obtain suitable summaries that can be used to estimate the risk of disclosure. One such summary is the mean of the reciprocal of the population frequency or Benedetti-Franconi risk, but we also investigate others such as the mode. We apply our approach to an artificial sample of the Italian 1991 Census data, drawn by means of a widely used sampling scheme. We report on results of this application and document the computational difficulties that we encountered. The risk estimates that we obtain are sensible, but suggest possible improvements and modifications to our methodology. We discuss these together with potential alternative strategies.