Information based data anonymization for classification utility

Anonymization is a practical approach to protect privacy in data. The major objective of privacy preserving data publishing is to protect private information in data whereas data is still useful for some intended applications, such as building classification models. In this paper, we argue that data generalization in anonymization should be determined by the classification capability of data rather than the privacy requirement. We make use of mutual information for measuring classification capability for generalization, and propose two k-anonymity algorithms to produce anonymized tables for building accurate classification models. The algorithms generalize attributes to maximize the classification capability, and then suppress values by a privacy requirement k (IACk) or distributional constraints (IACc). Experimental results show that algorithm IACk supports more accurate classification models and is faster than a benchmark utility-aware data anonymization algorithm.

[1]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[2]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[3]  S. Kullback,et al.  Information Theory and Statistics , 1959 .

[4]  Slava Kisilevich,et al.  Efficient Multidimensional Suppression for K-Anonymity , 2010, IEEE Transactions on Knowledge and Data Engineering.

[5]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[6]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[7]  Wendy Hui Wang,et al.  Privacy-preserving publishing microdata with full functional dependencies , 2011, Data Knowl. Eng..

[8]  Martin S. Olivier,et al.  On the use of economic price theory to find the optimum levels of privacy and information utility in non-perturbative microdata anonymisation , 2010, Data Knowl. Eng..

[9]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[10]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[11]  Philip S. Yu,et al.  Bottom-up generalization: a data mining solution to privacy protection , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[12]  Ninghui Li,et al.  On the tradeoff between privacy and utility in data publishing , 2009, KDD.

[13]  Philip S. Yu,et al.  Anonymizing Classification Data for Privacy Preservation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[14]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[15]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[16]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[17]  F. Yong,et al.  Alpha + + , 1999 .

[18]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[19]  David J. DeWitt,et al.  Workload-aware anonymization techniques for large-scale datasets , 2008, TODS.

[20]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[21]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  Ran Wolff,et al.  The VLDB Journal manuscript No. (will be inserted by the editor) Providing k-Anonymity in Data Mining , 2022 .

[24]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[25]  Bradley Malin,et al.  Inferring Genotype from Clinical Phenotype through a Knowledge Based Algorithm , 2001, Pacific Symposium on Biocomputing.

[26]  Raymond Chi-Wing Wong,et al.  Anonymization by Local Recoding in Data with Attribute Hierarchical Taxonomies , 2008, IEEE Transactions on Knowledge and Data Engineering.

[27]  Shouhuai Xu,et al.  Privacy-Preserving Data Mining through Knowledge Model Sharing , 2007, PinKDD.