Privacy Preserving k-Means Clustering in Multi-Party Environment

Extracting meaningful and valuable knowledge from databases is often done by various data mining algorithms. Nowadays, databases are distributed among two or more parties because of different reasons such as physical and geographical restrictions and the most important issue is privacy. Related data is normally maintained by more than one organization, each of which wants to keep its individual information private. Thus, privacy-preserving techniques and protocols are designed to perform data mining on distributed environments when privacy is highly concerned. Cluster analysis is a technique in data mining, by which data can be divided into some meaningful clusters, and it has an important role in different fields such as bio-informatics, marketing, machine learning, climate and medicine. k-means Clustering is a prominent algorithm in this category which creates a one-level clustering of data. In this paper we introduce privacy-preserving protocols for this algorithm, along with a protocol for Secure comparison, known as the Millionaires’ Problem, as a sub-protocol, to handle the clustering of horizontally or vertically partitioned data among two or more parties.

[1]  Byoungcheon Lee,et al.  An Efficient and Verifiable Solution to the Millionaire Problem , 2004, ICISC.

[2]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[3]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[4]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.

[5]  Osmar R. Zaïane,et al.  Privacy Preserving Clustering by Data Transformation , 2010, J. Inf. Data Manag..

[6]  Somesh Jha,et al.  Privacy Preserving Clustering , 2005, ESORICS.

[7]  Andrew Chi-Chih Yao,et al.  Protocols for secure computations , 1982, FOCS 1982.

[8]  Rebecca N. Wright,et al.  A New Privacy-Preserving Distributed k-Clustering Algorithm , 2006, SDM.

[9]  Joydeep Ghosh,et al.  Privacy-preserving distributed clustering using generative models , 2003, Third IEEE International Conference on Data Mining.

[10]  Hong Shen,et al.  Privacy Preserving C4.5 Algorithm Over Horizontally Partitioned Data , 2006, 2006 Fifth International Conference on Grid and Cooperative Computing (GCC'06).

[11]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[12]  Kai Han,et al.  Privacy Preserving ID3 Algorithm over Horizontally Partitioned Data , 2005, Sixth International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT'05).

[13]  Ali Miri,et al.  Privacy preserving ID3 using Gini Index over horizontally partitioned data , 2008, 2008 IEEE/ACS International Conference on Computer Systems and Applications.

[14]  Wenliang Du,et al.  Privacy-preserving cooperative statistical analysis , 2001, Seventeenth Annual Computer Security Applications Conference.

[15]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[16]  Ali Miri,et al.  Secure Dot-product Protocol Using Trace Functions , 2006, 2006 IEEE International Symposium on Information Theory.

[17]  Ananth Grama,et al.  An efficient protocol for Yao's millionaires' problem , 2003, 36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the.

[18]  David G. Stork,et al.  Pattern Classification , 1973 .

[19]  Rebecca N. Wright,et al.  Privacy-preserving distributed k-means clustering over arbitrarily partitioned data , 2005, KDD '05.

[20]  Moni Naor,et al.  Efficient oblivious transfer protocols , 2001, SODA '01.