On the Comparison of Generic Information Loss Measures and Cluster-Specific Ones

Masking methods are to protect data bases prior to their public release. They mask an original data file so that the new file ensures the privacy of data respondents. Information loss measures have been developed to evaluate in which extent the masked file diverges from the corresponding original file, and in what extent the same analyses on both files lead to the same results. Generic information loss measures ignore the intended data use of the file. These are the standard measures when data has to be released (e.g. published in the web) and there is no control on what kind of analyses users would perform. In this paper we study generic information loss measures, and we compare such measures with respect to cluster-specific ones. That is, measures specifically defined for the case in which the user will do clustering with the original data. To do so, we define such measures and then we do an extensive comparison of the two measures. The paper shows that the generic measures can cope with the information loss related to clustering.

[1]  Josep Domingo-Ferrer,et al.  Probabilistic Information Loss Measures in Confidentiality Protection of Continuous Microdata , 2005, Data Mining and Knowledge Discovery.

[2]  James M. Keller,et al.  A possibilistic approach to clustering , 1993, IEEE Trans. Fuzzy Syst..

[3]  Rajesh N. Davé,et al.  Characterization and detection of noise in clustering , 1991, Pattern Recognit. Lett..

[4]  Javier Herranz,et al.  Rethinking rank swapping to decrease disclosure risk , 2008, Data Knowl. Eng..

[5]  Josep Domingo-Ferrer,et al.  Using Mahalanobis Distance-Based Record Linkage for Disclosure Risk Assessment , 2006, Privacy in Statistical Databases.

[6]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[7]  Sumitra Mukherjee,et al.  A Polynomial Algorithm for Optimal Univariate Microaggregation , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Josep Domingo-Ferrer,et al.  On the complexity of optimal microaggregation for statistical disclosure control , 2001 .

[9]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[10]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[11]  Aryya Gangopadhyay,et al.  A privacy-preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms , 2006, The VLDB Journal.

[12]  Jim Burridge,et al.  Information preserving statistical obfuscation , 2003, Stat. Comput..

[13]  James C. Bezdek,et al.  A mixed c-means clustering model , 1997, Proceedings of 6th International Fuzzy Systems Conference.

[14]  Ramón López de Mántaras,et al.  A distance-based attribute selection measure for decision tree induction , 1991, Machine Learning.