Classifying data from protected statistical datasets

Statistical Disclosure Control (SDC) is an active research area in the recent years. The goal is to transform an original dataset X into a protected one X', such that X' does not reveal any relation between confidential and (quasi-)identifier attributes and such that X' can be used to compute reliable statistical information about X. Many specific protection methods have been proposed and analyzed, with respect to the levels of privacy and utility that they offer. However, when measuring utility, only differences between the statistical values of X and X' are considered. This would indicate that datasets protected by SDC methods can be used only for statistical purposes. We show in this paper that this is not the case, because a protected dataset X' can be used to construct good classifiers for future data. To do so, we describe an extensive set of experiments that we have run with different SDC protection methods and different (real) datasets. In general, the resulting classifiers are very good, which is good news for both the SDC and the Privacy-preserving Data Mining communities. In particular, our results question the necessity of some specific protection methods that have appeared in the privacy-preserving data mining (PPDM) literature with the clear goal of providing good classification.

[1]  Nong Ye,et al.  The Handbook of Data Mining , 2003 .

[2]  Keke Chen,et al.  Privacy preserving data classification with rotation perturbation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[3]  William E. Winkler,et al.  Re-identification Methods for Masked Microdata , 2004, Privacy in Statistical Databases.

[4]  Ramakrishnan Srikant,et al.  Privacy-preserving data mining , 2000, SIGMOD '00.

[5]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[6]  Wenliang Du,et al.  Building decision tree classifier on private data , 2002 .

[7]  U. Rovira,et al.  Chapter 6 A Quantitative Comparison of Disclosure Control Methods for Microdata , 2001 .

[8]  Ran Wolff,et al.  k-Anonymous Decision Tree Induction , 2006, PKDD.

[9]  Ling Liu,et al.  A Random Rotation Perturbation Approach to Privacy Preserving Data Classification , 2005 .

[10]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[11]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Data Mining Researchers , 2003 .

[12]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[13]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[14]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[15]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[16]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[17]  Jan Paredaens,et al.  Advances in Database Systems , 1994 .

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  Josep Domingo-Ferrer,et al.  Inference Control in Statistical Databases , 2002, Lecture Notes in Computer Science.

[20]  Javier Herranz,et al.  Rethinking rank swapping to decrease disclosure risk , 2008, Data Knowl. Eng..

[21]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[22]  Josep Domingo-Ferrer,et al.  Micro-aggregation-based heuristics for p-sensitive k-anonymity: one step beyond , 2008, PAIS '08.

[23]  Lior Rokach,et al.  Data Mining and Knowledge Discovery Handbook, 2nd ed , 2010, Data Mining and Knowledge Discovery Handbook, 2nd ed..

[24]  Vladimir Vapnik,et al.  The Support Vector Method , 1997, ICANN.

[25]  Josep Domingo-Ferrer,et al.  Inference Control in Statistical Databases, From Theory to Practice , 2002 .

[26]  Jay-J. Kim A METHOD FOR LIMITING DISCLOSURE IN MICRODATA BASED ON RANDOM NOISE AND , 2002 .

[27]  Jim Burridge,et al.  Information preserving statistical obfuscation , 2003, Stat. Comput..

[28]  Rathindra Sarathy,et al.  Generating Sufficiency-based Non-Synthetic Perturbed Data , 2008, Trans. Data Priv..

[29]  Josep Domingo-Ferrer,et al.  Probabilistic Information Loss Measures in Confidentiality Protection of Continuous Microdata , 2005, Data Mining and Knowledge Discovery.

[30]  Nancy L. Spruill THE CONFIDENTIALITY AND ANALYTIC USEFULNESS OF MASKED BUSINESS MICRODATA , 2002 .

[31]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[32]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[33]  V. Torra,et al.  Disclosure control methods and information loss for microdata , 2001 .

[34]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[35]  Traian Marius Truta,et al.  Protection : p-Sensitive k-Anonymity Property , 2006 .

[36]  Pascal Heus,et al.  Data Access in a Cyber World: Making Use of Cyberinfrastructure , 2008, Trans. Data Priv..

[37]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[38]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[39]  Chris Clifton,et al.  Privacy-preserving Naïve Bayes classification , 2008, The VLDB Journal.

[40]  Chris Clifton,et al.  When do data mining results violate privacy? , 2004, KDD.

[41]  Josep Domingo-Ferrer,et al.  On the complexity of optimal microaggregation for statistical disclosure control , 2001 .

[42]  Josep Domingo-Ferrer,et al.  An Anonymity Model Achievable Via Microaggregation , 2008, Secure Data Management.

[43]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[44]  Ruth Brand,et al.  Microdata Protection through Noise Addition , 2002, Inference Control in Statistical Databases.

[45]  William E. Winkler,et al.  Multiplicative Noise for Masking Continuous Data , 2001 .