COCOA: A Synthetic Data Generator for Testing Anonymization Techniques

Conducting extensive testing of anonymization techniques is critical to assess their robustness and identify the scenarios where they are most suitable. However, the access to real microdata is highly restricted and the one that is publicly-available is usually anonymized or aggregated; hence, reducing its value for testing purposes. In this paper, we present a framework (COCOA) for the generation of realistic synthetic microdata that allows to define multi-attribute relationships in order to preserve the functional dependencies of the data. We prove how COCOA is useful to strengthen the testing of anonymization techniques by broadening the number and diversity of the test scenarios. Results also show how COCOA is practical to generate large datasets.

[1]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[2]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[3]  Thomas Cerqueus,et al.  A Systematic Comparison and Evaluation of k-Anonymization Algorithms for Practitioners , 2014, Trans. Data Priv..

[4]  Jerome P. Reiter,et al.  Disclosure Risk Evaluation for Fully Synthetic Categorical Data , 2014, Privacy in Statistical Databases.

[5]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[6]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[7]  A. Omar Portillo-Dominguez,et al.  TRINI: an adaptive load balancing strategy based on garbage collection for clustered Java systems , 2016, Softw. Pract. Exp..

[8]  Joseph W. Sakshaug,et al.  Nonparametric Generation of Synthetic Data for Small Geographic Areas , 2014, Privacy in Statistical Databases.

[9]  Rico Wind,et al.  Simple and realistic data generation , 2006, VLDB.

[10]  Amer Diwan,et al.  The DaCapo benchmarks: java benchmarking development and analysis , 2006, OOPSLA '06.

[11]  Thomas Cerqueus,et al.  Synthetic Data Generation using Benerator Tool , 2013, ArXiv.

[12]  Craig W. Thompson,et al.  A parallel general-purpose synthetic data generator , 2007, SGMD.

[13]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[14]  Josep Domingo-Ferrer,et al.  Fast Generation of Accurate Synthetic Microdata , 2004, Privacy in Statistical Databases.

[15]  Lieven Eeckhout,et al.  How java programs interact with virtual machines at the microarchitectural level , 2003, OOPSLA.