Utility of Knowledge Discovered from Sanitized Data

While much attention has been paid to data sanitization methods with the aim of protecting users’ privacy, far less emphasis has been put to the usefulness of the sanitized data from the view point of knowledge discovery systems. We consider this question and ask whether sanitized data can be used to obtain knowledge that is not defined at the time of the sanitization. We propose a utility function for knowledge discovery algorithms, which quantifies the value of the knowledge from a perspective of users of the knowledge. We then use this utility function to evaluate the usefulness of the extracted knowledge when knowledge building is performed over the original data, and compare it to the case when knowledge building is performed over the sanitized data. Our experiments use an existing cooperative learning model of knowledge discovery and medical data, anonymized and perturbed using two widely known sanitization techniques, called E-differential privacy and k-anonymity. Our experimental results show that although the utility of sanitized data can be drastically reduced and in some cases completely lost, there are cases where the utility can be preserved. This confirms our strategy to look at triples consisting of a utility function, a sanitization mechanism, and a knowledge discovery algorithm that are useful in practice. We categorize a few instances of such triples based on usefulness obtained from experiments over a single database of medical records. We discuss our results and show directions for future work.

[1]  Jayant R. Haritsa,et al.  Maintaining Data Privacy in Association Rule Mining , 2002, VLDB.

[2]  Jie Gao,et al.  A Cooperative Multi-agent Data Mining Model and Its Application to Medical Data on Diabetes , 2005, AIS-ADM.

[3]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[4]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.

[5]  Jaideep Vaidya,et al.  Privacy preserving association rule mining in vertically partitioned data , 2002, KDD.

[6]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[7]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[8]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[9]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[10]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[11]  Jie Gao,et al.  CoLe: a cooperative data mining approach and its application to early diabetes detection , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[12]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[13]  Chris Clifton,et al.  Defining Privacy for Data Mining , 2002 .

[14]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.