Efficient microaggregation techniques for large numerical data volumes

The contradictory requirements of data privacy and data analysis have fostered the development of statistical disclosure control techniques. In this context, microaggregation is one of the most frequently used methods since it offers a good trade-off between simplicity and quality. Unfortunately, most of the currently available microaggregation algorithms have been devised to work with small datasets, while the size of current databases is constantly increasing. The usual way to tackle this problem is to partition large data volumes into smaller fragments that can be processed in reasonable time by available algorithms. This solution is applied at the cost of losing quality. In this paper, we revisited the computational needs of microaggregation showing that it can be reduced to two steps: sorting the dataset with regard to a vantage point and a set of k-nearest neighbors searches. Considering this new point of view, we propose three new efficient quality-preserving microaggregation algorithms based on k-nearest neighbors search techniques. We present a comparison of our approaches with the most significant strategies presented in the literature using three real very large datasets. Experimental results show that our proposals overcome previous techniques by keeping a better balance between performance and the quality of the anonymized dataset.

[1]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[2]  Joseph O'Rourke,et al.  Handbook of Discrete and Computational Geometry, Second Edition , 1997 .

[3]  Josep Domingo-Ferrer,et al.  A polynomial-time approximation to optimal multivariate microaggregation , 2008, Comput. Math. Appl..

[4]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[5]  Yu Hui-qun,et al.  An Improved V-MDAV Algorithm for l-Diversity , 2008, 2008 International Symposiums on Information Processing.

[6]  Panos Kalnis,et al.  Fast Data Anonymization with Low Information Loss , 2007, VLDB.

[7]  Piotr Indyk,et al.  Nearest Neighbors in High-Dimensional Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[8]  George Kokolakis,et al.  Computational Statistics and Data Analysis Importance Partitioning in Micro-aggregation , 2022 .

[9]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[10]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[11]  Traian Marius Truta,et al.  Protection : p-Sensitive k-Anonymity Property , 2006 .

[12]  A. Guttmma,et al.  R-trees: a dynamic index structure for spatial searching , 1984 .

[13]  A. Solanas,et al.  V-MDAV : A Multivariate Microaggregation With Variable Group Size , 2006 .

[14]  Sumitra Mukherjee,et al.  A Polynomial Algorithm for Optimal Univariate Microaggregation , 2003, IEEE Trans. Knowl. Data Eng..

[15]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[16]  Matthew Haines,et al.  Optimizing Search Strategies in k-d Trees , 2001 .

[17]  Chak-Kuen Wong,et al.  Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees , 1977, Acta Informatica.

[18]  Trevor Darrell,et al.  Nearest-Neighbor Searching and Metric Space Dimensions , 2006 .

[19]  Michiel Smid,et al.  Closest-Point Problems in Computational Geometry , 2000, Handbook of Computational Geometry.

[20]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[21]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[22]  Gonzalo Navarro Searching in metric spaces by spatial approximation , 2002, The VLDB Journal.

[23]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[24]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[25]  J. Sack,et al.  Handbook of computational geometry , 2000 .

[26]  Josep Domingo-Ferrer,et al.  Microaggregation Heuristics for p-Sensitive k-Anonymity , 2007 .

[27]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[28]  Roberto Di Pietro,et al.  A Linear-Time Multivariate Micro-aggregation for Privacy Protection in Uniform Very Large Data Sets , 2008, MDAI.

[29]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[30]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[31]  Nikos Mamoulis,et al.  Non-homogeneous generalization in privacy preserving data publishing , 2010, SIGMOD Conference.

[32]  A. Solanas,et al.  A 2/sup d/-tree-based blocking method for microaggregating very large data sets , 2006, First International Conference on Availability, Reliability and Security (ARES'06).

[33]  Sharad Mehrotra,et al.  Flexible Anonymization For Privacy Preserving Data Publishing: A Systematic Search Based Approach , 2007, SDM.

[34]  Josep Domingo-Ferrer,et al.  On the complexity of optimal microaggregation for statistical disclosure control , 2001 .

[35]  Beng Chin Ooi,et al.  Gorder: An Efficient Method for KNN Join Processing , 2004, VLDB.

[36]  Josep Domingo-Ferrer,et al.  Efficient multivariate data-oriented microaggregation , 2006, The VLDB Journal.

[37]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[38]  Allen Gersho,et al.  Fast search algorithms for vector quantization and pattern matching , 1984, ICASSP.

[39]  Sunil Arya,et al.  ANN: library for approximate nearest neighbor searching , 1998 .