Privacy-preserving distributed k-means clustering over arbitrarily partitioned data

Advances in computer networking and database technologies have enabled the collection and storage of vast quantities of data. Data mining can extract valuable knowledge from this data, and organizations have realized that they can often obtain better results by pooling their data together. However, the collected data may contain sensitive or private information about the organizations or their customers, and privacy concerns are exacerbated if data is shared between multiple organizations.Distributed data mining is concerned with the computation of models from data that is distributed among multiple participants. Privacy-preserving distributed data mining seeks to allow for the cooperative computation of such models without the cooperating parties revealing any of their individual data items. Our paper makes two contributions in privacy-preserving data mining. First, we introduce the concept of arbitrarily partitioned data, which is a generalization of both horizontally and vertically partitioned data. Second, we provide an efficient privacy-preserving protocol for k-means clustering in the setting of arbitrarily partitioned data.

[1]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[2]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[3]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[4]  A. Yao,et al.  Fair exchange with a semi-trusted third party (extended abstract) , 1997, CCS '97.

[5]  Judit Bar-Ilan,et al.  Non-cryptographic fault-tolerant computing in constant number of rounds of interaction , 1989, PODC '89.

[6]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Rob Mattison,et al.  Data Warehousing and Data Mining for Telecommunications , 1997 .

[9]  Olga Veksler,et al.  Image segmentation by nested cuts , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[10]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2000, Journal of Cryptology.

[11]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[12]  Kathleen Marchal,et al.  Adaptive quality-based clustering of gene expression profiles , 2002, Bioinform..

[13]  Chris Clifton,et al.  Privacy-preserving k-means clustering over vertically partitioned data , 2003, KDD '03.

[14]  Oded Goldreich,et al.  Foundations of Cryptography: Volume 2, Basic Applications , 2004 .

[15]  Bart Goethals,et al.  On Private Scalar Product Computation for Privacy-Preserving Data Mining , 2004, ICISC.