Utility-efficient Differentially Private K-means Clustering based on Cluster Merging

Differential privacy is widely used in data analysis. State-of-the-art $k$-means clustering algorithms with differential privacy typically add an equal amount of noise to centroids for each iterative computation. In this paper, we propose a novel differentially private $k$-means clustering algorithm, DP-KCCM, that significantly improves the utility of clustering by adding adaptive noise and merging clusters. Specifically, to obtain $k$ clusters with differential privacy, the algorithm first generates $n \times k$ initial centroids, adds adaptive noise for each iteration to get $n \times k$ clusters, and finally merges these clusters into $k$ ones. We theoretically prove the differential privacy of the proposed algorithm. Surprisingly, extensive experimental results show that: 1) cluster merging with equal amounts of noise improves the utility somewhat; 2) although adding adaptive noise only does not improve the utility, combining both cluster merging and adaptive noise further improves the utility significantly.

[1]  Andreas Rudolph,et al.  Techniques of Cluster Algorithms in Data Mining , 2002, Data Mining and Knowledge Discovery.

[2]  Hichem Omrani,et al.  The land transformation model-cluster framework: Applying k-means and the Spark computing environment for large scale land change analytics , 2019, Environ. Model. Softw..

[3]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[4]  Ying Wah Teh,et al.  Iterative big data clustering algorithms: a review , 2016, Softw. Pract. Exp..

[5]  Elisa Bertino,et al.  Differentially Private K-Means Clustering , 2015, CODASPY.

[6]  Prateek Thakral,et al.  The best clustering algorithms in data mining , 2016, 2016 International Conference on Communication and Signal Processing (ICCSP).

[7]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[8]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[9]  Frank McSherry Privacy integrated queries , 2010, Commun. ACM.

[10]  Ninghui Li,et al.  Differential Privacy: From Theory to Practice , 2016, Differential Privacy.

[11]  Khaled El Emam,et al.  The application of differential privacy to health data , 2012, EDBT-ICDT '12.

[12]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[13]  In Seop Na,et al.  Rice yield estimation based on K-means clustering with graph-cut segmentation using low-altitude UAV images , 2019, Biosystems Engineering.

[14]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[15]  Mustafa Musa Jaber,et al.  Cloud based framework for diagnosis of diabetes mellitus using K-means clustering , 2018, Health Information Science and Systems.

[16]  AghabozorgiSaeed,et al.  Iterative big data clustering algorithms , 2016 .

[17]  Pavlo D. Antonenko,et al.  Using cluster analysis for data mining in educational technology research , 2012, Educational Technology Research and Development.

[18]  Hongjun Lu,et al.  Effective Data Mining Using Neural Networks , 1996, IEEE Trans. Knowl. Data Eng..

[19]  Yonglong Luo,et al.  Outlier-eliminated k-means clustering algorithm based on differential privacy preservation , 2016, Applied Intelligence.

[20]  Sun Ji,et al.  Clustering Algorithms Research , 2008 .

[21]  Mohsen Guizani,et al.  KCLP: A k-Means Cluster-Based Location Privacy Protection Scheme in WSNs for IoT , 2018, IEEE Wireless Communications.

[22]  Feng Gao,et al.  An approach for tracking privacy disclosure , 2010, The 6th International Conference on Networked Computing and Advanced Information Management.

[23]  Li Xiong,et al.  Protecting Locations with Differential Privacy under Temporal Correlations , 2014, CCS.

[24]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[25]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[26]  Xiong Jin A Secure Self-Destruction Scheme with IBE for the Internet Content Privacy , 2014 .

[27]  S. M. Hashemy,et al.  Classification of aquifer vulnerability using K-means cluster analysis , 2017 .

[28]  Hongjie Jia,et al.  Research on data stream clustering algorithms , 2013, Artificial Intelligence Review.

[29]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.