Improving accuracy of classification models induced from anonymized datasets

The performance of classifiers and other data mining models can be significantly enhanced using the large repositories of digital data collected nowadays by public and private organizations. However, the original records stored in those repositories cannot be released to the data miners as they frequently contain sensitive information. The emerging field of Privacy Preserving Data Publishing (PPDP) deals with this important challenge. In this paper, we present NSVDist (Non-homogeneous generalization with Sensitive Value Distributions)-a new anonymization algorithm that, given minimal anonymity and diversity parameters along with an information loss measure, issues corresponding non-homogeneous anonymizations where the sensitive attribute is published as frequency distributions over the sensitive domain rather than in the usual form of exact sensitive values. In our experiments with eight datasets and four different classification algorithms, we show that classifiers induced from data generalized by NSVDist tend to be more accurate than classifiers induced using state-of-the-art anonymization algorithms.

[1]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[2]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[3]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[4]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[5]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[6]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[7]  Stan Matwin,et al.  Classifying data from protected statistical datasets , 2010, Comput. Secur..

[8]  Slava Kisilevich,et al.  Efficient Multidimensional Suppression for K-Anonymity , 2010, IEEE Transactions on Knowledge and Data Engineering.

[9]  Tamir Tassa,et al.  k-Anonymization Revisited , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[10]  Panos Kalnis,et al.  A framework for efficient data anonymization under privacy and accuracy constraints , 2009, TODS.

[11]  Philip S. Yu,et al.  Differentially private data release for data mining , 2011, KDD.

[12]  Nikos Mamoulis,et al.  Non-homogeneous generalization in privacy preserving data publishing , 2010, SIGMOD Conference.

[13]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[14]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[15]  Daniel Kifer,et al.  Attacks on privacy and deFinetti's theorem , 2009, SIGMOD Conference.

[16]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[17]  Alina Campan,et al.  Generating Microdata with P -Sensitive K -Anonymity Property , 2007, Secure Data Management.

[18]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[19]  Tamir Tassa,et al.  k-Concealment: An Alternative Model of k-Type Anonymity , 2012, Trans. Data Priv..

[20]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[21]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[22]  Nick Koudas,et al.  The design of a query monitoring system , 2009, TODS.

[23]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[24]  Benjamin C. M. Fung,et al.  Anonymizing sequential releases , 2006, KDD '06.

[25]  Benjamin C. M. Fung,et al.  Anonymizing healthcare data: a case study on the blood transfusion service , 2009, KDD.

[26]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[27]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[28]  Tamir Tassa,et al.  A practical approximation algorithm for optimal k-anonymity , 2011, Data Mining and Knowledge Discovery.

[29]  Tamir Tassa,et al.  k -Anonymization with Minimal Loss of Information , 2007, ESA.

[30]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[31]  Chris Clifton,et al.  Thoughts on k-Anonymization , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[32]  Marzena Kryszkiewicz,et al.  Rough Set Approach to Incomplete Information Systems , 1998, Inf. Sci..

[33]  David J. DeWitt,et al.  Workload-aware anonymization , 2006, KDD '06.

[34]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[35]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[36]  Philip S. Yu,et al.  Anonymizing Classification Data for Privacy Preservation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[37]  Cristina Nita-Rotaru,et al.  A survey of attack and defense techniques for reputation systems , 2009, CSUR.

[38]  Marzena Kryszkiewicz,et al.  Rules in Incomplete Information Systems , 1999, Inf. Sci..

[39]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[40]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[41]  Lior Rokach,et al.  Privacy-preserving data mining: A feature set partitioning approach , 2010, Inf. Sci..

[42]  Leslie Burnett,et al.  The "GeneTrustee": a universal identification system that ensures privacy and confidentiality for human genetic databases. , 2003, Journal of law and medicine.

[43]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[44]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[45]  Lior Rokach,et al.  Limiting disclosure of sensitive data in sequential releases of databases , 2012, Inf. Sci..

[46]  Tamir Tassa,et al.  Efficient Anonymizations with Enhanced Utility , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[47]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[48]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.