Data Mining and Privacy

With the emergence of Internet, it is now possible to connect and access sources of information and databases throughout the world. At the same time, this raises many questions regarding the privacy and the security of the data, in particular how to mine useful information while preserving the privacy of sensible and confidential data. Privacy-preserving data mining is a relatively new but rapidly growing field that studies how data mining algorithms affect the privacy of data and tries to find and analyze new algorithms that preserve this privacy. At first glance, it may seem that data mining and privacy have orthogonal goals, the first one being concerned with the discovery of useful knowledge from data whereas the second is concerned with the protection of data’s privacy. Historically, the interactions between privacy and data mining have been questioned and studied since more than a decade ago, but the name of the domain itself was coined more recently by two seminal papers attacking the subject from two very different perspectives (Agrawal & Srikant, 2000; Lindell & Pinkas, 2000). The first paper (Agrawal & Srikant, 2000) takes the approach of randomizing the data through the injection of noise, and then recovers from it by applying a reconstruction algorithm before a learning task (the induction of a decision tree) is carried out on the reconstructed dataset. The second paper (Lindell & Pinkas, 2000) adopts a cryptographic view of the problem and rephrases it within the general framework of secure multiparty computation. The outline of this chapter is the following. First, the area of privacy-preserving data mining is illustrated through three scenarios, before a classification of privacy-preserving algorithms is described and the three main approaches currently used are detailed. Finally, the future trends and challenges that await the domain are discussed before concluding.

[1]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[2]  Udo Winand,et al.  The VLEG based production and maintenance process for web-based learning applications , 2002 .

[3]  John Wang,et al.  Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications , 2008 .

[4]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[5]  Stephen E. Fienberg,et al.  Data Swapping: Variations on a Theme by Dalenius and Reiss , 2004, Privacy in Statistical Databases.

[6]  Philip Calvert,et al.  Encyclopedia of Data Warehousing and Mining , 2006 .

[7]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[8]  David Chaum,et al.  Multiparty unconditionally secure protocols , 1988, STOC '88.

[9]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[10]  Elisa Bertino,et al.  A Framework for Evaluating Privacy Preserving Data Mining Algorithms* , 2005, Data Mining and Knowledge Discovery.

[11]  Chi-Jen Lu,et al.  Oblivious polynomial evaluation and oblivious neural learning , 2001, Theor. Comput. Sci..

[12]  Balázs Kégl,et al.  Privacy-preserving boosting , 2007, Data Mining and Knowledge Discovery.

[13]  Taneli Mielikäinen,et al.  Cryptographically private support vector machines , 2006, KDD '06.

[14]  Kevin M. Curtin,et al.  Integrating GIS and Maximal Covering Models to Determine Optimal Police Patrol Areas , 2005 .

[15]  Joan Feigenbaum,et al.  Secure Multiparty Computation of Approximations , 2001, ICALP.

[16]  Alexandre V. Evfimievski,et al.  Limiting privacy breaches in privacy preserving data mining , 2003, PODS.

[17]  Elisa Bertino,et al.  State-of-the-art in privacy preserving data mining , 2004, SGMD.