论文信息 - A Parallel Cop-Kmeans Clustering Algorithm Based on MapReduce Framework

A Parallel Cop-Kmeans Clustering Algorithm Based on MapReduce Framework

Clustering with background information is highly desirable in many business applications recently due to its potential to capture important semantics of the business/dataset. Must-Link and Cannot-Link constraints between a given pair of instances in the dataset are common prior knowledge incorporated in many clustering algorithms today. Cop-Kmeans incorporates these constraints in its clustering mechanism. However, due to rapidly increasing scale of data today, it is becoming overwhelmingly difficult for it to handle massive dataset. In this paper, we propose a parallel Cop-Kmeans algorithm based on MapReduce- a technique which basically distributes the clustering load over a given number of processors. Experimental results show that this approach can scale well to massive dataset while maintaining all crucial characteristics of the serial Cop-Kmeans algorithm.

[1] Claire Cardie,et al. Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[2] Malay K. Pakhira. Clustering Large Databases in Distributed Environment , 2009, 2009 IEEE International Advance Computing Conference.

[3] Brian Hayes,et al. What Is Cloud Computing? , 2019, Cloud Technologies.

[4] S. S. Ravi,et al. Identifying and Generating Easy Sets of Constraints for Clustering , 2006, AAAI.

[5] Claire Cardie,et al. Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[6] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[7] Tianrui Li,et al. AN IMPROVED COP-KMEANS ALGORITHM FOR SOLVING CONSTRAINT VIOLATION , 2010 .

[8] Claire Cardie,et al. Intelligent Clustering with Instance-Level Constraints , 2002 .

[9] Jiali Mao,et al. The Study of Parallel K-Means Algorithm , 2006, 2006 6th World Congress on Intelligent Control and Automation.