Attribute Segregation based on Feature Ranking Framework for Privacy Preserving Data Mining

Attributes in macro-data have to be segregating based on their sensitivity for privacy preservation purposes. Automating this attribute segregation becomes complicated in high dimensional datasets and data streams. In this work, information or correlation of the attribute on the target class attribute is measured using Information Gain [IG], Gain Ratio [GR] and Pearson Correlation [PC] ranker based feature selection methods and this values are used to segregate them as Sensitive Attributes [SA], Quasi Identifiers [QI] and Non-Sensitive [NS] Attributes. Segregated attributes are subjected to various levels of privacy preservation using both the proposed Double layer Perturbation [DLP] and Single Layer Perturbation [SLP] algorithms to form the level-1 perturbed datasets. The level-1 perturbed dataset is further perturbed by applying SLP algorithm to form level-2 and level-3 privacy preserved datasets. Thus, the multiple versions of Adult dataset created are distributed to data seekers based on their trust levels in Multi Trust Level [MTL] environment. The privacy preserved dataset versions created using the proposed algorithms are evaluated based on their utility, distortion and purity metrics. The results show that the ranker methods are able to identify attributes which had sensitive content as either SA or QI automatically and the proposed perturbed datasets have good utility on selected classification and clustering algorithms when compared to original and L-diversified datasets. Also, the distortion values of these datasets signify that they can prevent diversity attacks.

[1]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[2]  Zhihong Chong,et al.  Clustering-oriented privacy-preserving data publishing , 2012, Knowl. Based Syst..

[3]  Minghua Chen,et al.  Enabling Multilevel Trust in Privacy Preserving Data Mining , 2011, IEEE Transactions on Knowledge and Data Engineering.

[4]  V. Rajalakshmi,et al.  Anonymization by Data Relocation Using Sub-clustering for Privacy Preserving Data Mining , 2014 .

[5]  Lior Rokach,et al.  Privacy-preserving data mining: A feature set partitioning approach , 2010, Inf. Sci..

[6]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[7]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[8]  Jie Wang,et al.  Knowledge and Information Systems REGULAR PAPER , 2006 .

[9]  Yiyu Yao,et al.  Information-Theoretic Measures for Knowledge Discovery and Data Mining , 2003 .

[10]  Xiaowei Ying,et al.  On Attribute Disclosure in Randomization Based Privacy Preserving Data Publishing , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[11]  Jun Zhang,et al.  A Comparative Study on Data Perturbation with Feature Selection , 2011 .

[12]  Ruggero G. Pensa,et al.  From Context to Distance: Learning Dissimilarity for Categorical Data Clustering , 2012, TKDD.

[13]  Liang Hu,et al.  Using Noise Addition Method Based on Pre-mining to Protect Healthcare Privacy , 2012 .

[14]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[15]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[16]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[17]  Yehuda Lindell,et al.  Secure Multiparty Computation for Privacy-Preserving Data Mining , 2009, IACR Cryptol. ePrint Arch..

[18]  N. Nagaveni,et al.  Evaluation of a perturbation-based technique for privacy preservation in a multi-party clustering scenario , 2013, Inf. Sci..

[19]  Songjie Gong,et al.  Privacy-preserving Collaborative Filtering based on Randomized Perturbation Techniques and Secure Multiparty Computation , 2011 .

[20]  Carlo Zaniolo,et al.  Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss , 2009, Proc. VLDB Endow..

[21]  Slava Kisilevich,et al.  Efficient Multidimensional Suppression for K-Anonymity , 2010, IEEE Transactions on Knowledge and Data Engineering.