Mathematically optimized, recursive prepartitioning strategies for k-anonymous microaggregation of large-scale datasets

Abstract The technical contents of this work fall within the statistical disclosure control (SDC) field, which concerns the postprocessing of the demographic portion of the statistical results of surveys containing sensitive personal information, in order to effectively safeguard the anonymity of the participating respondents. A widely known technique to solve the problem of protecting the privacy of the respondents involved beyond the mere suppression of their identifiers is the k-anonymous microaggregation. Unfortunately, most microaggregation algorithms that produce competitively low levels of distortions exhibit a superlinear running time, typically scaling with the square of the number of records in the dataset. This work proposes and analyzes an optimized prepartitioning strategy to reduce significantly the running time for the k-anonymous microaggregation algorithm operating on large datasets, with mild loss in data utility with respect to that of MDAV, the underlying method. The optimization strategy is based on prepartitioning a dataset recursively until the desired k-anonymity parameter is achieved. Traditional microaggregation algorithms have quadratic computational complexity in the form Θ(n2). By using the proposed method and fixing the number of recurrent prepartitions we obtain subquadratic complexity in the form Θ(n3/2), Θ(n4/3), ..., depending on the number of prepartitions. Alternatively, fixing the ratio between the size of the microcell and the macrocell on each prepartition, quasilinear complexity in the form Θ(nlog n) is achieved. Our method is readily applicable to large-scale datasets with numerical demographic attributes.

[1]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[2]  Peng Zhang,et al.  e-DMDAV: A new privacy preserving algorithm for wearable enterprise information systems , 2018, Enterp. Inf. Syst..

[3]  Chris Clifton,et al.  Privacy-Preserving Data Mining , 2006, Encyclopedia of Database Systems.

[4]  Daniel Aloise,et al.  A derivative-free algorithm for refining numerical microaggregation solutions , 2015, Int. Trans. Oper. Res..

[5]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  Josep Domingo-Ferrer,et al.  Utility-preserving differentially private data releases via individual ranking microaggregation , 2015, Inf. Fusion.

[7]  Ton de Waal,et al.  Introduction to Statistical Disclosure Control , 1996 .

[8]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[9]  Josep Domingo-Ferrer,et al.  From t-Closeness to PRAM and Noise Addition Via Information Theory , 2008, Privacy in Statistical Databases.

[10]  Josep Domingo-Ferrer,et al.  Hybrid microdata using microaggregation , 2010, Inf. Sci..

[11]  Jordi Forné,et al.  A modification of the Lloyd algorithm for k-anonymous quantization , 2013, Inf. Sci..

[12]  Josep Domingo-Ferrer,et al.  A polynomial-time approximation to optimal multivariate microaggregation , 2008, Comput. Math. Appl..

[13]  Hong Shen,et al.  Improving k-anonymity based privacy preservation for collaborative filtering , 2018, Comput. Electr. Eng..

[14]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[15]  R. Rosenthal,et al.  Statistical Procedures and the Justification of Knowledge in Psychological Science , 1989 .

[16]  H. Vincent Poor,et al.  Utility-Privacy Tradeoffs in Databases: An Information-Theoretic Approach , 2011, IEEE Transactions on Information Forensics and Security.

[17]  Siham Tabik,et al.  A Data Partitioning Model for Highly Heterogeneous Systems , 2016, Euro-Par Workshops.

[18]  Vicenç Torra,et al.  Probabilistic Metric Spaces for Privacy by Design Machine Learning Algorithms: Modeling Database Changes , 2018, DPM/CBT@ESORICS.

[19]  Vitaly Shmatikov,et al.  The cost of privacy: destruction of data-mining utility in anonymized data publishing , 2008, KDD.

[20]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[21]  Saeed Jalili,et al.  Fine granular proximity breach prevention during numerical data anonymization , 2017, Trans. Data Priv..

[22]  Pei-Chann Chang,et al.  Density-based microaggregation for statistical disclosure control , 2010, Expert Syst. Appl..

[23]  Vicenç Torra,et al.  A General Algorithm for k-anonymity on Dynamic Databases , 2018, DPM/CBT@ESORICS.

[24]  Stan Matwin,et al.  A Review of Attribute Disclosure Control , 2015, Advanced Research in Data Privacy.

[25]  Yücel Saygin,et al.  Privacy-Preserving Learning Analytics: Challenges and Techniques , 2017, IEEE Transactions on Learning Technologies.

[26]  Jordi Forné,et al.  Incremental $k$ -Anonymous Microaggregation in Large-Scale Electronic Surveys With Optimized Scheduling , 2018, IEEE Access.

[27]  Josep Domingo-Ferrer,et al.  Fuzzy Microaggregation for Microdata Protection , 2003, J. Adv. Comput. Intell. Intell. Informatics.

[28]  Qishan Zhang,et al.  Grey maximum distance to average vector based on quasi identifier attribute , 2017 .

[29]  Jordi Forné,et al.  On the measurement of privacy as an attacker’s estimation error , 2012, International Journal of Information Security.

[30]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[31]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[32]  Dayong Wang,et al.  Deep Learning for Identifying Metastatic Breast Cancer , 2016, ArXiv.

[33]  Jordi Forné,et al.  p-Probabilistic k-anonymous microaggregation for the anonymization of surveys with uncertain participation , 2017, Inf. Sci..

[34]  Josep Domingo-Ferrer,et al.  Efficient multivariate data-oriented microaggregation , 2006, The VLDB Journal.

[35]  L. Sweeney Simple Demographics Often Identify People Uniquely , 2000 .

[36]  Josep Domingo-Ferrer,et al.  A Critique of k-Anonymity and Some of Its Enhancements , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[37]  Elisa Bertino,et al.  Using Anonymized Data for Classification , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[38]  Saeed Jalili,et al.  Fast data-oriented microaggregation algorithm for large numerical datasets , 2014, Knowl. Based Syst..

[39]  Qing Wang,et al.  Publishing Differentially Private Datasets via Stable Microaggregation , 2019, EDBT.

[40]  Sadok Ben Yahia,et al.  Generating k-Anonymous Microdata by Fuzzy Possibilistic Clustering , 2017, DEXA.

[41]  Junfeng Yang,et al.  Optimizing Data Partitioning for Data-Parallel Computing , 2011, HotOS.

[42]  Yanchun Zhang,et al.  An approximate microaggregation approach for microdata protection , 2012, Expert Syst. Appl..

[43]  Josep Domingo-Ferrer,et al.  Differentially private data publishing via cross-moment microaggregation , 2020, Inf. Fusion.

[44]  Ebaa Fayyoumi,et al.  Applying Genetic Algorithms on Multi-level Micro-Aggregation Techniques for Secure Statistical Databases , 2018, 2018 IEEE/ACS 15th International Conference on Computer Systems and Applications (AICCSA).

[45]  Jordi Forné,et al.  Computational Improvements in Parallelized K-Anonymous Microaggregation of Large Databases , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW).

[46]  Jordi Forné,et al.  Does $k$ -Anonymous Microaggregation Affect Machine-Learned Macrotrends? , 2018, IEEE Access.

[47]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[48]  Stan Matwin,et al.  Secure Multi-Party linear Regression , 2014, EDBT/ICDT Workshops.

[49]  Md. Enamul Kabir,et al.  New Multi-dimensional Sorting Based K-Anonymity Microaggregation for Statistical Disclosure Control , 2012, SecureComm.

[50]  Matthias Templ,et al.  Statistical Disclosure Control for Microdata: Methods and Applications in R , 2017 .

[51]  Jordi Forné,et al.  An algorithm for k-anonymous microaggregation and clustering inspired by the design of distortion-optimized quantizers , 2011, Data Knowl. Eng..

[52]  Lior Rokach,et al.  Privacy-preserving data mining: A feature set partitioning approach , 2010, Inf. Sci..

[53]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[54]  Josep Domingo-Ferrer,et al.  From t-Closeness-Like Privacy to Postrandomization via Information Theory , 2010, IEEE Transactions on Knowledge and Data Engineering.

[55]  M. Templ Statistical Disclosure Control for Microdata Using the R-Package sdcMicro , 2008, Trans. Data Priv..

[56]  Hua Wang,et al.  Enhanced P-Sensitive K-Anonymity Models for Privacy Preserving Data Publishing , 2008, Trans. Data Priv..

[57]  Jordi Forné,et al.  Efficient k-anonymous microaggregation of multivariate numerical data via principal component analysis , 2019, Inf. Sci..

[58]  Josep Domingo-Ferrer,et al.  H(k)-private Information Retrieval from Privacy-uncooperative Queryable Databases.">h(k)-private Information Retrieval from Privacy-uncooperative Queryable Databases , 2009, Online Inf. Rev..

[59]  Traian Marius Truta,et al.  Protection : p-Sensitive k-Anonymity Property , 2006 .

[60]  Josep Domingo-Ferrer,et al.  Enhancing data utility in differential privacy via microaggregation-based k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2014, The VLDB Journal.

[61]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.