Does $k$ -Anonymous Microaggregation Affect Machine-Learned Macrotrends?

In the era of big data, the availability of massive amounts of information makes privacy protection more necessary than ever. Among a variety of anonymization mechanisms, microaggregation is a common approach to satisfy the popular requirement of <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-anonymity in statistical databases. In essence, <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-anonymous microaggregation aggregates quasi-identifiers to hide the identity of each data subject within a group of other <inline-formula> <tex-math notation="LaTeX">$k-1$ </tex-math></inline-formula> subjects. As any perturbative mechanism, however, anonymization comes at the cost of some information loss that may hinder the ulterior purpose of the released data, which very often is building machine-learning models for macrotrends analysis. To assess the impact of microaggregation on the utility of the anonymized data, it is necessary to evaluate the resulting accuracy of said models. In this paper, we address the problem of measuring the effect of <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-anonymous microaggregation on the empirical utility of microdata. We quantify utility accordingly as the accuracy of classification models learned from microaggregated data, and evaluated over original test data. Our experiments indicate, with some consistency, that the impact of the de facto microaggregation standard (maximum distance to average vector) on the performance of machine-learning algorithms is often minor to negligible for a wide range of <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> for a variety of classification algorithms and data sets. Furthermore, experimental evidences suggest that the traditional measure of distortion in the community of microdata anonymization may be inappropriate for evaluating the utility of microaggregated data.

[1]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[2]  Elisa Bertino,et al.  Using Anonymized Data for Classification , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Josep Domingo-Ferrer,et al.  A Critique of k-Anonymity and Some of Its Enhancements , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[4]  Josep Domingo-Ferrer,et al.  From t-Closeness to PRAM and Noise Addition Via Information Theory , 2008, Privacy in Statistical Databases.

[5]  Pei-Chann Chang,et al.  Density-based microaggregation for statistical disclosure control , 2010, Expert Syst. Appl..

[6]  H. Schneeweiß,et al.  The Effect of Microaggregation Procedures on the Estimation of Linear Models: A Simulation Study , 2005 .

[7]  Privacy by design in big data , 2015 .

[8]  Stan Matwin,et al.  A Review of Attribute Disclosure Control , 2015, Advanced Research in Data Privacy.

[9]  Ming-Syan Chen,et al.  Privacy-preserving outsourcing support vector machines with random transformation , 2010, KDD.

[10]  Josep Domingo-Ferrer,et al.  H(k)-private Information Retrieval from Privacy-uncooperative Queryable Databases.">h(k)-private Information Retrieval from Privacy-uncooperative Queryable Databases , 2009, Online Inf. Rev..

[11]  Traian Marius Truta,et al.  Protection : p-Sensitive k-Anonymity Property , 2006 .

[12]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[13]  Jordi Forné,et al.  An algorithm for k-anonymous microaggregation and clustering inspired by the design of distortion-optimized quantizers , 2011, Data Knowl. Eng..

[14]  Lior Rokach,et al.  Privacy-preserving data mining: A feature set partitioning approach , 2010, Inf. Sci..

[15]  Yücel Saygin,et al.  Privacy-Preserving Learning Analytics: Challenges and Techniques , 2017, IEEE Transactions on Learning Technologies.

[16]  Kian-Lee Tan,et al.  CASTLE: Continuously Anonymizing Data Streams , 2011, IEEE Transactions on Dependable and Secure Computing.

[17]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[18]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[19]  Ting Yu,et al.  UMicS: from anonymized data to usable microdata , 2013, CIKM.

[20]  Vitaly Shmatikov,et al.  The cost of privacy: destruction of data-mining utility in anonymized data publishing , 2008, KDD.

[21]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[22]  Christina Thorpe,et al.  COCOA: A Synthetic Data Generator for Testing Anonymization Techniques , 2016, PSD.

[23]  Charlie Obimbo,et al.  A Novel Differential Privacy Approach that Enhances Classification Accuracy , 2016, C3S2E.

[24]  Chris Clifton,et al.  Decision Tree Classification on Outsourced Data , 2016, ArXiv.

[25]  Xin Shen,et al.  Connected Geomatics in the big data era , 2017, Int. J. Digit. Earth.

[26]  Hua Wang,et al.  Enhanced P-Sensitive K-Anonymity Models for Privacy Preserving Data Publishing , 2008, Trans. Data Priv..

[27]  Ninghui Li,et al.  On the tradeoff between privacy and utility in data publishing , 2009, KDD.

[28]  Kamalika Chaudhuri,et al.  Privacy-preserving logistic regression , 2008, NIPS.

[29]  Josep Domingo-Ferrer,et al.  Hybrid microdata using microaggregation , 2010, Inf. Sci..

[30]  Thomas Cerqueus,et al.  Synthetic Data Generation using Benerator Tool , 2013, ArXiv.

[31]  David J. DeWitt,et al.  Workload-aware anonymization , 2006, KDD '06.

[32]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[33]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[34]  Thomas Cerqueus,et al.  A Systematic Comparison and Evaluation of k-Anonymization Algorithms for Practitioners , 2014, Trans. Data Priv..

[35]  Slava Kisilevich,et al.  Efficient Multidimensional Suppression for K-Anonymity , 2010, IEEE Transactions on Knowledge and Data Engineering.

[36]  Stan Matwin,et al.  Task Oriented Privacy Preserving Data Publishing Using Feature Selection , 2014, Canadian Conference on AI.

[37]  Edgar R. Weippl,et al.  The Right to Be Forgotten: Towards Machine Learning on Perturbed Knowledge Bases , 2016, CD-ARES.

[38]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[39]  Josep Domingo-Ferrer,et al.  Efficient multivariate data-oriented microaggregation , 2006, The VLDB Journal.

[40]  Josep Domingo-Ferrer,et al.  A polynomial-time approximation to optimal multivariate microaggregation , 2008, Comput. Math. Appl..

[41]  A. Solanas,et al.  V-MDAV : A Multivariate Microaggregation With Variable Group Size , 2006 .

[42]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[43]  H. Vincent Poor,et al.  Utility-Privacy Tradeoffs in Databases: An Information-Theoretic Approach , 2011, IEEE Transactions on Information Forensics and Security.

[44]  Josep Domingo-Ferrer,et al.  From t-Closeness-Like Privacy to Postrandomization via Information Theory , 2010, IEEE Transactions on Knowledge and Data Engineering.

[45]  Jordi Forné,et al.  On the measurement of privacy as an attacker’s estimation error , 2012, International Journal of Information Security.

[46]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.