Evaluation of a perturbation-based technique for privacy preservation in a multi-party clustering scenario

Data processing techniques and the growth of the internet have resulted in a data explosion. The data that are now available may contain sensitive information that could, if misused, jeopardise the privacy of individuals. In today's web world, the privacy of personal and personal business information is a growing concern for individuals, corporate entities and governments. Preserving personal and sensitive information is critical to the success of today's data mining techniques. Preserving the privacy of data is even more crucial in critical sectors such as defence, health care and finance. Privacy Preserving Data Mining (PPDM) addresses such issues by balancing the preservation of privacy and the utilisation of data. Traditionally, Geometrical Data Transformation Methods (GDTMs) have been widely used for privacy preserving clustering. The drawback of these methods is that geometric transformation functions are invertible, which results in a lower level of privacy protection. In this work, a Principal Component Analysis (PCA)-based technique that preserves the privacy of sensitive information in a multi-party clustering scenario is proposed. The performance of this technique is evaluated further by applying a classical K-means clustering algorithm, as well as a machine learning-based clustering method on synthetic and real world datasets. The accuracy of clustering is computed before and after privacy-preserving transformation. The proposed PCA-based transformation method resulted in superior privacy protection and better performance when compared to the traditional GDTMs.

[1]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[2]  Scott Dick,et al.  An analysis of privacy signals on the World Wide Web: Past, present and future , 2009, Inf. Sci..

[3]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[4]  M. Elter,et al.  The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process. , 2007, Medical physics.

[5]  Jieh-Shan Yeh,et al.  HHUIF and MSICF: Novel algorithms for privacy preserving utility mining , 2010, Expert Syst. Appl..

[6]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[7]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[8]  Teuvo Kohonen,et al.  Essentials of the self-organizing map , 2013, Neural Networks.

[9]  Weining Zhang,et al.  Extending l-diversity to generalize sensitive data , 2011, Data Knowl. Eng..

[10]  Bart W. Schermer,et al.  The limits of privacy in automated profiling and data mining , 2011, Comput. Law Secur. Rev..

[11]  I. Jolliffe Principal Component Analysis , 2002 .

[12]  N. Nagaveni,et al.  Preservation of Data Privacy Using PCA Based Transformation , 2009, 2009 International Conference on Advances in Recent Technologies in Communication and Computing.

[13]  Osmar R. Zaïane,et al.  A privacy-preserving clustering approach toward secure and effective data analysis for business collaboration , 2007, Comput. Secur..

[14]  O. Mangasarian,et al.  Multisurface method of pattern separation for medical diagnosis applied to breast cytology. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Young-Seuk Park,et al.  Self-Organizing Map , 2008 .

[16]  Kun Liu,et al.  Random projection-based multiplicative data perturbation for privacy preserving distributed data mining , 2006, IEEE Transactions on Knowledge and Data Engineering.

[17]  Jong-Seok Lee,et al.  Data clustering by minimizing disconnectivity , 2011, Inf. Sci..

[18]  Keke Chen,et al.  Privacy-Preserving Multiparty Collaborative Mining with Geometric Data Perturbation , 2009, IEEE Transactions on Parallel and Distributed Systems.

[19]  Vicenç Torra,et al.  Evaluation of information loss for privacy preserving data mining through comparison of fuzzy partitions , 2010, International Conference on Fuzzy Systems.

[20]  Chu-Sing Yang,et al.  A time-efficient pattern reduction algorithm for k-means clustering , 2011, Inf. Sci..

[21]  I-Cheng Yeh,et al.  Knowledge discovery on RFM model using Bernoulli sequence , 2009, Expert Syst. Appl..

[22]  Wei Yuan,et al.  Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization , 2011, Inf. Sci..

[23]  Weijia Yang,et al.  A novel anonymization algorithm: Privacy protection and knowledge preservation , 2010, Expert Syst. Appl..

[24]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[25]  Sheng Zhong,et al.  Two methods for privacy preserving data mining with malicious participants , 2007, Inf. Sci..

[26]  Keke Chen,et al.  Towards Attack-Resilient Geometric Data Perturbation , 2007, SDM.

[27]  Chris Clifton,et al.  Multirelational k-Anonymity , 2009, IEEE Trans. Knowl. Data Eng..

[28]  Osmar R. Zaïane,et al.  Privacy Preserving Clustering by Data Transformation , 2010, J. Inf. Data Manag..

[29]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[30]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[31]  Lior Rokach,et al.  Privacy-preserving data mining: A feature set partitioning approach , 2010, Inf. Sci..

[32]  Chris Clifton,et al.  SECURITY AND PRIVACY IMPLICATIONS OF DATA MINING , 1996 .