A Semantic Information Loss Metric for Privacy Preserving Publication

Data distortion is inevitable in privacy-preserving data publication and a lot of quality metrics have been proposed to measure the quality of anonymous data, where information loss metrics are popularly used. Most of existing information loss metrics, however, are non-semantic and hence are limited in reflecting the data distortion. Thus, the utility of anonymous data based on these metrics is constrained. In this paper, we propose a novel semantic information loss metric SILM, which takes into account the correlation among attributes. This new metric can capture the distortion more precisely than the state of art information loss metrics especially for the scenario where strong correlations exist among attributes. We evaluated the effect of SILM on data quality in terms of the accuracy of aggregate query answering and classification. Comprehensive experiments demonstrate that SILM can help improve the quality of anonymous data much more especially if integrated with proper anonymization algorithms.

[1]  Ian Witten,et al.  Data Mining , 2000 .

[2]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[3]  Yufei Tao,et al.  Personalized privacy preservation , 2006, Privacy-Preserving Data Mining.

[4]  David J. DeWitt,et al.  Workload-aware anonymization , 2006, KDD '06.

[5]  Feng Zhu,et al.  On Multidimensional k-Anonymity with Local Recoding Generalization , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[6]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[7]  Yu Liu,et al.  Set-Expression Based Method for Effective Privacy Preservation , 2008, 2008 The Ninth International Conference on Web-Age Information Management.

[8]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Raymond Chi-Wing Wong,et al.  Anonymization by Local Recoding in Data with Attribute Hierarchical Taxonomies , 2008, IEEE Transactions on Knowledge and Data Engineering.

[10]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[11]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[12]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[13]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[14]  Elisa Bertino,et al.  Using Anonymized Data for Classification , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[15]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[16]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[17]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[18]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.