Data Masking for Disclosure Limitation

Governmental agencies that conduct surveys and censuses collect data from respondents with the purpose of releasing it in the form of statistical summaries. The more detailed the summary is, the more likely a data intruder will be able to extract confidential data about individual respondents from the released data. However, there are various ways of redesigning the data product and/or modifying the data themselves to protect the data while preserving their usefulness. We discuss methods that achieve these two goals: (i) a data intruder will not be able to extract, with high confidence, confidential data directly from the data product or derive confidential microdata from several data products; and (ii) the released data are still quite detailed and useful to most data users, including researchers. Such “data-masking” methods comprise a fast growing field often called statistical disclosure control. We discuss some simpler methods that have been used for decades, such as detail reduction, cell suppression, and data swapping; some methods developed in the 1990s, such as rank swapping, data shuffling, and multiplicative noise; and some methods developed in recent decade, such as randomization of microdata with constraints (PRAM) and synthetic data. Keywords: disclosure limitation; statistical disclosure control; data swapping; rank swapping; multiplicative noise; synthetic data; cell suppression

[1]  James P. Kelly,et al.  Balancing Quality and Confidentiality for Multivariate Tabular Data , 2004, Privacy in Statistical Databases.

[2]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[3]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[4]  P. Massell,et al.  Recent Developments in the Use of Noise for Protecting Magnitude Data Tables: Balancing to Improve Data Quality and Rounding that Preserves Protection , 2007 .

[5]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[6]  Laura Voshell Zayatz,et al.  Using noise for disclosure limi-tation of establishment tabular data , 1998 .

[7]  Barry Schouten,et al.  Remote access systems for statistical analysis of microdata , 2003, Stat. Comput..

[8]  Peter Kooiman,et al.  Post randomisation for statistical disclosure control: Theory and implementation , 1997 .

[9]  Damien McAullay,et al.  Remote access methods for exploratory data analysis and statistical modelling: Privacy-Preserving Analytics® , 2008, Comput. Methods Programs Biomed..

[10]  Rathindra Sarathy,et al.  Data Shuffling - A New Masking Approach for Numerical Data , 2006, Manag. Sci..

[11]  Jerome P. Reiter,et al.  Simultaneous Use of Multiple Imputation for Missing Data and Disclosure Limitation , 2022 .

[12]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .