p-Probabilistic k-anonymous microaggregation for the anonymization of surveys with uncertain participation

We develop a probabilistic variant of k-anonymous microaggregation which we term p-probabilistic resorting to a statistical model of respondent participation in order to aggregate quasi-identifiers in such a manner that k-anonymity is concordantly enforced with a parametric probabilistic guarantee. Succinctly owing the possibility that some respondents may not finally participate, sufficiently larger cells are created striving to satisfy k-anonymity with probability at least p. The microaggregation function is designed before the respondents submit their confidential data. More precisely, a specification of the function is sent to them which they may verify and apply to their quasi-identifying demographic variables prior to submitting the microaggregated data along with the confidential attributes to an authorized repository. We propose a number of metrics to assess the performance of our probabilistic approach in terms of anonymity and distortion which we proceed to investigate theoretically in depth and empirically with synthetic and standardized data. We stress that in addition to constituting a functional extension of traditional microaggregation, thereby broadening its applicability to the anonymization of statistical databases in a wide variety of contexts, the relaxation of trust assumptions is arguably expected to have a considerable impact on user acceptance and ultimately on data utility through mere availability.

[1]  Jordi Nin,et al.  Efficient microaggregation techniques for large numerical data volumes , 2012, International Journal of Information Security.

[2]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[3]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[4]  Josep Domingo-Ferrer,et al.  On the complexity of optimal microaggregation for statistical disclosure control , 2001 .

[5]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[6]  Joel Max,et al.  Quantizing for minimum distortion , 1960, IRE Trans. Inf. Theory.

[7]  Jordi Forné,et al.  Measuring the privacy of user profiles in personalized information systems , 2014, Future Gener. Comput. Syst..

[8]  Josep Domingo-Ferrer,et al.  Efficient multivariate data-oriented microaggregation , 2006, The VLDB Journal.

[9]  Josep Domingo-Ferrer,et al.  Probabilistic k-anonymity through microaggregation and data swapping , 2012, 2012 IEEE International Conference on Fuzzy Systems.

[10]  Josep Domingo-Ferrer,et al.  Hybrid microdata using microaggregation , 2010, Inf. Sci..

[11]  Josep Domingo-Ferrer,et al.  From t-Closeness to PRAM and Noise Addition Via Information Theory , 2008, Privacy in Statistical Databases.

[12]  Sheng Zhong,et al.  k-Anonymous data collection , 2009, Inf. Sci..

[13]  Josep Domingo-Ferrer,et al.  t-Closeness through Microaggregation: Strict Privacy with Enhanced Utility Preservation , 2015, IEEE Transactions on Knowledge and Data Engineering.

[14]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[15]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[16]  Alexandre V. Evfimievski,et al.  Limiting privacy breaches in privacy preserving data mining , 2003, PODS.

[17]  Josep Domingo-Ferrer,et al.  H(k)-private Information Retrieval from Privacy-uncooperative Queryable Databases.">h(k)-private Information Retrieval from Privacy-uncooperative Queryable Databases , 2009, Online Inf. Rev..

[18]  Traian Marius Truta,et al.  Protection : p-Sensitive k-Anonymity Property , 2006 .

[19]  Claude E. Shannon,et al.  Communication theory of secrecy systems , 1949, Bell Syst. Tech. J..

[20]  Jordi Forné,et al.  Optimizing the design parameters of threshold pool mixes for anonymity and delay , 2014, Comput. Networks.

[21]  Vitaly Shmatikov,et al.  The cost of privacy: destruction of data-mining utility in anonymized data publishing , 2008, KDD.

[22]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[23]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[24]  Josep Domingo-Ferrer,et al.  From t-Closeness-Like Privacy to Postrandomization via Information Theory , 2010, IEEE Transactions on Knowledge and Data Engineering.

[25]  Jordi Forné,et al.  Private Location-Based Information Retrieval via k-Anonymous Clustering , 2010 .

[26]  Jordi Forné,et al.  An algorithm for k-anonymous microaggregation and clustering inspired by the design of distortion-optimized quantizers , 2011, Data Knowl. Eng..

[27]  Lior Rokach,et al.  Privacy-preserving data mining: A feature set partitioning approach , 2010, Inf. Sci..

[28]  Javier Herranz,et al.  On the disclosure risk of multivariate microaggregation , 2008, Data Knowl. Eng..

[29]  Jordi Forné,et al.  Optimized Query Forgery for Private Information Retrieval , 2010, IEEE Transactions on Information Theory.

[30]  Jordi Forné,et al.  A modification of the Lloyd algorithm for k-anonymous quantization , 2013, Inf. Sci..

[31]  Jordi Forné,et al.  On the measurement of privacy as an attacker’s estimation error , 2012, International Journal of Information Security.

[32]  Jordi Forné,et al.  On collaborative anonymous communications in lossy networks , 2014, Secur. Commun. Networks.

[33]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[34]  Jorge J. Moré,et al.  The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[35]  Josep Domingo-Ferrer,et al.  A Critique of k-Anonymity and Some of Its Enhancements , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[36]  Chin-Chen Chang,et al.  TFRP: An efficient microaggregation algorithm for statistical disclosure control , 2007, J. Syst. Softw..

[37]  Pei-Chann Chang,et al.  Density-based microaggregation for statistical disclosure control , 2010, Expert Syst. Appl..

[38]  Kian-Lee Tan,et al.  CASTLE: Continuously Anonymizing Data Streams , 2011, IEEE Transactions on Dependable and Secure Computing.

[39]  Yu Hui-qun,et al.  An Improved V-MDAV Algorithm for l-Diversity , 2008, 2008 International Symposiums on Information Processing.

[40]  Josep Domingo-Ferrer,et al.  A polynomial-time approximation to optimal multivariate microaggregation , 2008, Comput. Math. Appl..

[41]  A. Solanas,et al.  V-MDAV : A Multivariate Microaggregation With Variable Group Size , 2006 .

[42]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[43]  Hua Wang,et al.  Enhanced P-Sensitive K-Anonymity Models for Privacy Preserving Data Publishing , 2008, Trans. Data Priv..