Hybrid microdata using microaggregation

Statistical disclosure control (also known as privacy-preserving data mining) of microdata is about releasing data sets containing the answers of individual respondents protected in such a way that: (i) the respondents corresponding to the released records cannot be re-identified; (ii) the released data stay analytically useful. Usually, the protected data set is generated by either masking (i.e. perturbing) the original data or by generating synthetic (i.e. simulated) data preserving some pre-selected statistics of the original data. Masked data may approximately preserve a broad range of distributional characteristics, although very few of them (if any) are exactly preserved; on the other hand, synthetic data exactly preserve the pre-selected statistics and may seem less disclosive than masked data, but they do not preserve at all any statistics other than those pre-selected. Hybrid data obtained by mixing the original data and synthetic data have been proposed in the literature to combine the strengths of masked and synthetic data. We show how to easily obtain hybrid data by combining microaggregation with any synthetic data generator. We show that numerical hybrid data exactly preserving means and covariances of original data and approximately preserving other statistics as well as some subdomain analyses can be obtained as a particular case with a very simple parameterization. The new method is competitive versus both the literature on hybrid data and plain multivariate microaggregation.

[1]  Vicenç Torra,et al.  Microaggregation for Categorical Variables: A Median Based Approach , 2004, Privacy in Statistical Databases.

[2]  William E. Winkler,et al.  Re-identification Methods for Masked Microdata , 2004, Privacy in Statistical Databases.

[3]  Josep Domingo-Ferrer,et al.  A measure of variance for hierarchical nominal attributes , 2008, Inf. Sci..

[4]  Josep Domingo-Ferrer,et al.  A Survey of Inference Control Methods for Privacy-Preserving Data Mining , 2008, Privacy-Preserving Data Mining.

[5]  Josep Domingo-Ferrer,et al.  A Genetic Approach to Multivariate Microaggregation for Database Privacy , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[6]  Ana L. N. Fred,et al.  Analysis of consensus partition in cluster ensemble , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[7]  Vicenc Torra,et al.  Information Fusion in Data Mining , 2003 .

[8]  Anna Oganian,et al.  Combinations of SDC Methods for Microdata Protection , 2006, Privacy in Statistical Databases.

[9]  Subhash C. Kak,et al.  Online data storage using implicit security , 2009, Inf. Sci..

[10]  Michael Cohen,et al.  Sensitive Micro Data Protection Using Latin Hypercube Sampling Technique , 2002, Inference Control in Statistical Databases.

[11]  Josep Domingo-Ferrer,et al.  Record linkage methods for multidatabase data mining , 2003 .

[12]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[13]  Jan Paredaens,et al.  Advances in Database Systems , 1994 .

[14]  V. Torra,et al.  Comparing SDC Methods for Microdata on the Basis of Information Loss and Disclosure Risk , 2004 .

[15]  M. Templ Statistical Disclosure Control for Microdata Using the R-Package sdcMicro , 2008, Trans. Data Priv..

[16]  Rathindra Sarathy,et al.  Generating Sufficiency-based Non-Synthetic Perturbed Data , 2008, Trans. Data Priv..

[17]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[18]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[19]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[20]  Josep Domingo-Ferrer,et al.  LHS-Based Hybrid Microdata vs Rank Swapping and Microaggregation for Numeric Microdata Protection , 2002, Inference Control in Statistical Databases.

[21]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[22]  Josep Domingo-Ferrer,et al.  A polynomial-time approximation to optimal multivariate microaggregation , 2008, Comput. Math. Appl..

[23]  Josep Domingo-Ferrer,et al.  ESSNET-SDC Deliverable Report on Synthetic Data Files , 2009 .

[24]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[25]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[26]  Josep Domingo-Ferrer,et al.  Erratum to "A measure of variance for hierarchical nominal attributes" , 2009, Inf. Sci..

[27]  Jim Burridge,et al.  Information preserving statistical obfuscation , 2003, Stat. Comput..

[28]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[29]  Philip S. Yu,et al.  A Condensation Approach to Privacy Preserving Data Mining , 2004, EDBT.