Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems

This paper provides an overview of methods of masking microdata so that the data can be placed in public-use files. It divides the methods according to whether they have been demonstrated to provide analytic properties or not. For those methods that have been shown to provide one or two sets of analytic properties in the masked data, we indicate where the data may have limitations for most analyses and how re-identification might or can be performed. We cover several methods for producing synthetic data and possible computational extensions for better automating the creation of the underlying statistical models. We finish by providing background on analysis-specific and general information-loss metrics to stimulate research.

[1]  Stephen E. Fienberg,et al.  Confidentiality and Disclosure Limitation , 2005 .

[2]  Steven P. Reiss Practical Data-Swapping: The First Steps , 1980, 1980 IEEE Symposium on Security and Privacy.

[3]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[4]  Rathindra Sarathy,et al.  An Improved Security Requirement for Data Perturbation with Implications for E-Commerce , 2001, Decis. Sci..

[5]  D. Lambert Measures of Disclosure Risks and Harm , 1993 .

[6]  Rathindra Sarathy,et al.  A General Additive Data Perturbation Method for Database Security , 1999 .

[7]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[8]  D. Defays,et al.  Masking Microdata Using Micro-Aggregation , 1999 .

[9]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[10]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[11]  Eric R. Ziegel,et al.  Business survey methods , 1995 .

[12]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[13]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[14]  William E. Winkler,et al.  Single-Ranking Micro-aggregation and Re-identification , 2002 .

[15]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[16]  Jeffrey S. Simonoff,et al.  The use of regression methodology for the compromise of confidential information in statistical databases , 1987, TODS.

[17]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[18]  Julian Stander,et al.  A Bayesian Hierarchical Model Approach to Risk Estimation in Statistical Disclosure Limitation , 2004, Privacy in Statistical Databases.

[19]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[20]  Ton de Waal,et al.  Statistical Disclosure Control in Practice , 1996 .

[21]  Chris J. Skinner,et al.  Estimating the re-identification risk per record in microdata , 1998 .

[22]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[23]  L. Sweeney,et al.  Trail Re-Identification: Learning Who You Are From Where You Have Been , 2003 .

[24]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[25]  Rathindra Sarathy,et al.  Perturbing Nonnormal Confidential Attributes: The Copula Approach , 2002, Manag. Sci..

[26]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[27]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[28]  Josep Domingo-Ferrer,et al.  Disclosure risk assessment in statistical microdata protection via advanced record linkage , 2003, Stat. Comput..

[29]  C. Skinner,et al.  Special Uniques, Random Uniques and Sticky Populations: Some Counterintuitive Effects of Geographical Detail on Disclosure Risk , 1998 .

[30]  Luisa Franconi,et al.  Statistical and Technological Solutions for Controlled Data Dissemination , 1998 .

[31]  Josep Domingo-Ferrer,et al.  LHS-Based Hybrid Microdata vs Rank Swapping and Microaggregation for Numeric Microdata Protection , 2002, Inference Control in Statistical Databases.

[32]  Simon D. Woodcock,et al.  Disclosure Limitation in Longitudinal Linked Data , 2002 .

[33]  Stefan Bender,et al.  Re-identifying Register Data by Survey Data Using Cluster Analysis: An Empirical Study , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[34]  George T. Duncan,et al.  Disclosure Risk vs. Data Utility: The R-U Confidentiality Map , 2003 .

[35]  W. Winkler,et al.  MASKING MICRODATA FILES , 1995 .

[36]  Stephen E. Fienberg,et al.  Modelling User Uncertainty for Disclosure Risk and Data Utility , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[37]  Stephen E. Fienberg,et al.  Disclosure limitation using perturbation and related methods for categorical data , 1998 .

[38]  Ruth Brand,et al.  Microdata Protection through Noise Addition , 2002, Inference Control in Statistical Databases.

[39]  S. Fienberg,et al.  A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data , 1997 .

[40]  Tommy Wright,et al.  U.S. Bureau of the Census , 2006 .

[41]  Jan Schlörer Disclosure from Statistical Databases: Quantitative Aspects of Trackers , 1980, ACM Trans. Database Syst..

[42]  Michael Cohen,et al.  Sensitive Micro Data Protection Using Latin Hypercube Sampling Technique , 2002, Inference Control in Statistical Databases.

[43]  Andrew McCallum,et al.  Object Consolodation by Graph Partitioning with a Conditionally›Trained Distance Metric , 2003 .

[44]  A. Hout,et al.  Randomized Response, Statistical Disclosure Control and Misclassificatio: a Review , 2002 .

[45]  R. Little,et al.  Selective Multiple Imputation of Keys for Statistical Disclosure Control in Microdata , 2003 .

[46]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[47]  Fritz Scheuren,et al.  Regression Analysis of Data Files that Are Computer Matched , 1993 .

[48]  Ag De Waal,et al.  A view on statistical disclosure control for microdata , 1996 .

[49]  Silvia Polettini,et al.  Maximum entropy simulation for microdata protection , 2003, Stat. Comput..

[50]  Rupert W. Ford,et al.  A Computational Algorithm for Handling the Special Uniques Problem , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[51]  Jay-J. Kim A METHOD FOR LIMITING DISCLOSURE IN MICRODATA BASED ON RANDOM NOISE AND , 2002 .

[52]  R. Fildes Journal of the Royal Statistical Society (B): Gary K. Grunwald, Adrian E. Raftery and Peter Guttorp, 1993, “Time series of continuous proportions”, 55, 103–116.☆ , 1993 .

[53]  Josep Domingo-Ferrer,et al.  Inference Control in Statistical Databases , 2002, Lecture Notes in Computer Science.

[54]  William E. Winkler,et al.  Re-identification Methods for Evaluating the Confidentiality of Analytically Valid Microdata , 1998 .

[55]  Yosef Rinott On models for statistical disclosure risk estimation , 2003 .

[56]  C. Skinner,et al.  A measure of disclosure risk for microdata , 2002 .

[57]  G. Paass Disclosure Risk and Disclosure Avoidance for Microdata , 1988 .

[58]  William E. Winkler,et al.  Bayesian Networks Representations, Generalized Imputation, and Synthetic Micro-data Satisfying Analytic Constraints , 2002 .

[59]  U. Rovira,et al.  Chapter 6 A Quantitative Comparison of Disclosure Control Methods for Microdata , 2001 .

[60]  W. Keller,et al.  Disclosure control of microdata , 1990 .

[61]  L. Sweeney Computational Disclosure Control for Medical Microdata , 1997 .