Fast data-oriented microaggregation algorithm for large numerical datasets

Microaggregation is a successful mechanism to solve the tension between respondent privacy and data quality in the context of Statistical Disclosure Control. Microaggregation, for numerical datasets, is defined as a clustering problem with the constraint of having at least k records in each group, such that the sum of the within-group squared error (SSE) is minimized. Unfortunately, the data publisher has to execute an algorithm iteratively for different values of k to investigate a good trade-off between privacy and utility. Multiple execution of an algorithm on large numerical datasets is resource wasting, since most of the computations are repetitive. In this paper, we propose a Fast Data-oriented Microaggregation algorithm (FDM) that efficiently anonymizes large multivariate numerical datasets for multiple successive values of k. Experimental results on real world datasets demonstrate the superiority of the method in terms of both the data quality and time complexity. Moreover, the method usually achieves a better trade-off between disclosure risk and information loss of the protected dataset in comparison with previous techniques.

[1]  Javier Jiménez,et al.  An evolutionary approach to enhance data privacy , 2011, Soft Comput..

[2]  Panos Kalnis,et al.  Fast Data Anonymization with Low Information Loss , 2007, VLDB.

[3]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[4]  Pei-Chann Chang,et al.  Comparison of microaggregation approaches on anonymized data quality , 2010, Expert Syst. Appl..

[5]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[6]  Huawen Liu,et al.  MAGE: A semantics retaining K-anonymization method for mixed data , 2014, Knowl. Based Syst..

[7]  H. Paessens,et al.  The savings algorithm for the vehicle routing problem , 1988 .

[8]  David Sánchez,et al.  Semantic adaptive microaggregation of categorical microdata , 2012, Comput. Secur..

[9]  William E. Winkler,et al.  Re-identification Methods for Masked Microdata , 2004, Privacy in Statistical Databases.

[10]  Sumitra Mukherjee,et al.  A Polynomial Algorithm for Optimal Univariate Microaggregation , 2003, IEEE Trans. Knowl. Data Eng..

[11]  G. Clarke,et al.  Scheduling of Vehicles from a Central Depot to a Number of Delivery Points , 1964 .

[12]  Josep Domingo-Ferrer,et al.  Privacy in Statistical Databases: k-Anonymity Through Microaggregation , 2006, 2006 IEEE International Conference on Granular Computing.

[13]  Pei-Chann Chang,et al.  Density-based microaggregation for statistical disclosure control , 2010, Expert Syst. Appl..

[14]  Brook Heaton New Record Ordering Heuristics for Multivariate Microaggregation. , 2012 .

[15]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[16]  Josep Domingo-Ferrer,et al.  On the complexity of optimal microaggregation for statistical disclosure control , 2001 .

[17]  Josep Domingo-Ferrer,et al.  Efficient multivariate data-oriented microaggregation , 2006, The VLDB Journal.

[18]  Ruth Brand,et al.  Microdata Protection through Noise Addition , 2002, Inference Control in Statistical Databases.

[19]  Josep Domingo-Ferrer,et al.  Outlier Protection in Continuous Microdata Masking , 2004, Privacy in Statistical Databases.

[20]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[21]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[22]  Thomas L. Magnanti,et al.  Implementing vehicle routing algorithms , 1977, Networks.

[23]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[24]  Jordi Nin,et al.  Efficient microaggregation techniques for large numerical data volumes , 2012, International Journal of Information Security.

[25]  M. Templ Statistical Disclosure Control for Microdata Using the R-Package sdcMicro , 2008, Trans. Data Priv..

[26]  Josep Domingo-Ferrer,et al.  Probabilistic Information Loss Measures in Confidentiality Protection of Continuous Microdata , 2005, Data Mining and Knowledge Discovery.

[27]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[28]  B. John Oommen,et al.  A survey on statistical disclosure control and micro-aggregation techniques for secure statistical databases , 2010 .

[29]  Saeed Jalili,et al.  Multivariate microaggregation by iterative optimization , 2013, Applied Intelligence.

[30]  Stan Matwin,et al.  Classifying data from protected statistical datasets , 2010, Comput. Secur..

[31]  Georgios Tziritas,et al.  Successive Group Selection for Microaggregation , 2013, IEEE Transactions on Knowledge and Data Engineering.

[32]  J. K. Lenstra,et al.  Local Search in Combinatorial Optimisation. , 1997 .

[33]  Javier Herranz,et al.  Rethinking rank swapping to decrease disclosure risk , 2008, Data Knowl. Eng..

[34]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[35]  Javier Herranz,et al.  Kd-trees and the real disclosure risks of large statistical databases , 2012, Inf. Fusion.

[36]  Jim Burridge,et al.  Information preserving statistical obfuscation , 2003, Stat. Comput..

[37]  Vicenç Torra,et al.  Information fusion in data privacy: A survey , 2012, Inf. Fusion.

[38]  Chin-Chen Chang,et al.  TFRP: An efficient microaggregation algorithm for statistical disclosure control , 2007, J. Syst. Softw..