Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method

Clustering very large datasets is a challenging problem for data mining and processing. MapReduce is considered as a powerful programming framework which significantly reduces executing time by dividing a job into several tasks and executes them in a distributed environment. K-Means which is one of the most used clustering methods and K-Means based on MapReduce is considered as an advanced solution for very large dataset clustering. However, the executing time is still an obstacle due to the increasing number of iterations when there is an increase of dataset size and number of clusters. This paper presents a new approach for reducing the number of iterations of K-Means algorithm which can be applied to very large dataset clustering. This new method can reduce up to 30 percent of iterations while maintaining up to 98 percent accuracy when tested with several very large datasets with real data type attributes. Based on the significant results from the experiments, this paper proposes a new fast K-Means clustering method for very large datasets based on MapReduce combined with a new cutting method (abbreviated to FMR.K-Means).

[1]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[2]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[3]  Josva Kleist,et al.  Migration = cloning; aliasing , 1999 .

[4]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[5]  Vladik Kreinovich,et al.  Advance Trends in Soft Computing - Proceedings of WCSC 2013, December 16-18, San Antonio, Texas, USA , 2014, WCSC.

[6]  Nigel Shadbolt,et al.  Knowledge Engineering and Management , 2000 .

[7]  Silke Wagner,et al.  Comparing Clusterings - An Overview , 2007 .

[8]  Saeed Shahrivari,et al.  High performance parallel $$k$$k-means clustering for disk-resident datasets on multi-core CPUs , 2014, The Journal of Supercomputing.

[9]  C. L. Philip Chen,et al.  Data-intensive applications, challenges, techniques and technologies: A survey on Big Data , 2014, Inf. Sci..

[10]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[11]  Bo Li,et al.  Parallel K-Means Clustering of Remote Sensing Images Based on MapReduce , 2010, WISM.

[12]  Yan Yang,et al.  A Parallel Cop-Kmeans Clustering Algorithm Based on MapReduce Framework , 2011 .

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Brian Hayes,et al.  What Is Cloud Computing? , 2019, Cloud Technologies.

[15]  Keqiu Li,et al.  Efficient $k$ -Means++ Approximation with MapReduce , 2014, IEEE Trans. Parallel Distributed Syst..

[16]  Kilian Stoffel,et al.  Parallel k/h-Means Clustering for Large Data Sets , 1999, Euro-Par.

[17]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[18]  Aruna Tiwari,et al.  Handling Big Data with Fuzzy Based Classification Approach , 2013, WCSC.

[19]  Agma J. M. Traina,et al.  Open issues for partitioning clustering methods: an overview , 2014, WIREs Data Mining Knowl. Discov..

[20]  Fu Lee Wang,et al.  Web Information Systems and Mining , 2010, Lecture Notes in Computer Science.

[21]  Athanasios V. Vasilakos,et al.  Big data: From beginning to future , 2016, Int. J. Inf. Manag..

[22]  Anjan K. Koundinya,et al.  MapReduce Design of K-Means Clustering Algorithm , 2013, 2013 International Conference on Information Science and Applications (ICISA).

[23]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.