Output privacy in data mining

Privacy has been identified as a vital requirement in designing and implementing data mining systems. In general, privacy preservation demands protecting both input and output privacy: the former refers to sanitizing the raw data itself before performing mining; while the latter refers to preventing the mining output (models or patterns) from malicious inference attacks. This article presents a systematic study on the problem of protecting output privacy in data mining, and particularly, stream mining: (i) we highlight the importance of this problem by showing that even sufficient protection of input privacy does not guarantee that of output privacy; (ii) we present a general inferencing and disclosure model that exploits the intrawindow and interwindow privacy breaches in stream mining output; (iii) we propose a light-weighted countermeasure that effectively eliminates these breaches without explicitly detecting them, while minimizing the loss of output accuracy; (iv) we further optimize the basic scheme by taking account of two types of semantic constraints, aiming at maximally preserving utility-related semantics while maintaining hard privacy guarantee; (v) finally, we conduct extensive experimental evaluation over both synthetic and real data to validate the efficacy of our approach.

[1]  Qi Wang,et al.  On the privacy preserving properties of random data perturbation techniques , 2003, Third IEEE International Conference on Data Mining.

[2]  Ivan P. Fellegi,et al.  On the Question of Statistical Confidentiality , 1972 .

[3]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[4]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[5]  Dino Pedreschi,et al.  Anonymity preserving pattern discovery , 2008, The VLDB Journal.

[6]  Gu Si-yang,et al.  Privacy preserving association rule mining in vertically partitioned data , 2006 .

[7]  Philip S. Yu,et al.  Handicapping attacker's confidence: an alternative to k-anonymization , 2006, Knowledge and Information Systems.

[8]  Gene Tsudik,et al.  A Privacy-Preserving Index for Range Queries , 2004, VLDB.

[9]  Richard J. Lipton,et al.  Secure databases: protection against user influence , 1979, TODS.

[10]  Keke Chen,et al.  Privacy preserving data classification with rotation perturbation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[11]  S. Reiss,et al.  Data-swapping: A technique for disclosure control , 1982 .

[12]  L. Cox Suppression Methodology and Statistical Disclosure Control , 1980 .

[13]  Charu C. Aggarwal,et al.  On the design and quantification of privacy preserving data mining algorithms , 2001, PODS.

[14]  Henryk Wozniakowski,et al.  The statistical security of a statistical database , 1984, TODS.

[15]  Gultekin Özsoyoglu,et al.  Statistical database design , 1981, TODS.

[16]  Toon Calders Computational complexity of itemset frequency satisfiability , 2004, PODS '04.

[17]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[18]  Calton Pu,et al.  A General Proximity Privacy Principle , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Jimeng Sun,et al.  Hiding in the Crowd: Privacy Preservation on Evolving Streams through Correlation Tracking , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[20]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[21]  Thomas Lukasiewicz,et al.  Probabilistic logic programming with conditional constraints , 2001, TOCL.

[22]  Stephen A. Vavasis,et al.  Quadratic Programming is in NP , 1990, Inf. Process. Lett..

[23]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[24]  Kyuseok Shim,et al.  Approximate algorithms for K-anonymity , 2007, SIGMOD '07.

[25]  Luke O'Connor,et al.  The inclusion-exclusion principle and its applications to cryptography , 1993 .

[26]  Ling Liu,et al.  Butterfly: Protecting Output Privacy in Stream Mining , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[27]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[28]  Philip S. Yu,et al.  Catch the moment: maintaining closed frequent itemsets over a data stream sliding window , 2006, Knowledge and Information Systems.

[29]  Wenliang Du,et al.  Deriving private information from randomized data , 2005, SIGMOD '05.

[30]  Dorothy E. Denning,et al.  Secure statistical databases with random sample queries , 1980, TODS.

[31]  Chris Clifton,et al.  When do data mining results violate privacy? , 2004, KDD.

[32]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[33]  Laks V. S. Lakshmanan,et al.  Preservation Of Patterns and Input-Output Privacy , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[34]  Arie Shoshani,et al.  Statistical Databases: Characteristics, Problems, and some Solutions , 1982, VLDB.

[35]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[36]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[37]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[38]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[39]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.