Parallel batch k-means for Big data clustering

Abstract The application of clustering algorithms is expanding due to the rapid growth of data volumes. Nevertheless, existing algorithms are not always effective because of high computational complexity. A new parallel batch clustering algorithm based on the k-means algorithm is proposed. The proposed algorithm splits a dataset into equal partitions and reduces the exponential growth of computations. The goal is to preserve the characteristics of the dataset while increasing the clustering speed. The centers of the clusters are calculated for each partition, which are merged and also clustered later. The approach to determine the optimal batch size is also considered. The statistical significance of the proposed approach is provided. Six experimental datasets are used to evaluate the effectiveness of the proposed parallel batch clustering. The obtained results are compared with the k-means algorithm. The analysis shows the practical applicability of the proposed algorithm to Big Data.

[1]  Steven K. Thompson,et al.  Sample Size for Estimating Multinomial Proportions , 1987 .

[2]  Alva L. Couch,et al.  Parallel K-means Clustering Algorithm on NOWs , 2003 .

[3]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[4]  Mustapha Lebbah,et al.  Micro-Batching Growing Neural Gas for Clustering Data Streams Using Spark Streaming , 2015, INNS Conference on Big Data.

[5]  Hassan Ismkhan,et al.  I-k-means-+: An iterative clustering algorithm based on an enhanced version of the k-means , 2018, Pattern Recognit..

[6]  Ferani E. Zulvia,et al.  Application of metaheuristic based fuzzy K-modes algorithm to supplier clustering , 2018, Comput. Ind. Eng..

[7]  Chong-Wah Ngo,et al.  k-means: A revisit , 2018, Neurocomputing.

[8]  Huayu Zhang,et al.  Improved K-means algorithm based on density Canopy , 2018, Knowl. Based Syst..

[9]  Hans A. Kestler,et al.  A highly efficient multi-core algorithm for clustering extremely large datasets , 2010, BMC Bioinformatics.

[10]  Lueder von Bremen,et al.  CorClustST - Correlation-based clustering of big spatio-temporal datasets , 2020, Future Gener. Comput. Syst..

[11]  Simon Fong,et al.  Clustering big IoT data by metaheuristic optimized mini-batch and parallel partition-based DGC in Hadoop , 2018, Future Gener. Comput. Syst..

[12]  Tanvir Habib Sardar,et al.  An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm , 2018, Future Computing and Informatics Journal.

[13]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[14]  Ujjwal Maulik,et al.  Efficient parallel algorithm for pixel classification in remote sensing imagery , 2012, GeoInformatica.

[15]  Theodore T. Allen,et al.  Timely Decision Analysis Enabled by Efficient Social Media Modeling , 2017, Decis. Anal..

[16]  Andre Kleyner,et al.  A Bayesian Approach to Determine Test Sample Size Requirements for Reliability Demonstration Retesting after Product Design Change , 2015 .

[17]  Rasim M. Alguliyev,et al.  Weighted Clustering for Anomaly Detection in Big Data , 2018, Statistics, Optimization & Information Computing.

[18]  Christian S. Jensen,et al.  Building Accurate 3D Spatial Networks to Enable Next Generation Intelligent Transportation Systems , 2013, 2013 IEEE 14th International Conference on Mobile Data Management.

[19]  Syed Fawad Hussain,et al.  A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data , 2019, Expert Syst. Appl..

[20]  Lawrence O. Hall,et al.  Accelerating Fuzzy-C Means Using an Estimated Subsample Size , 2014, IEEE Transactions on Fuzzy Systems.

[21]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[22]  Mohammad Khalilzadeh,et al.  CLUS-MCDA: A novel framework based on cluster analysis and multiple criteria decision theory in a supplier selection problem , 2018, Comput. Ind. Eng..

[23]  Salvatore Cuomo,et al.  A GPU-accelerated parallel K-means algorithm , 2017, Comput. Electr. Eng..

[24]  Fred W. Glover,et al.  A Tabu search based clustering algorithm and its parallel implementation on Spark , 2017, Appl. Soft Comput..

[25]  Rasim M. Alguliyev,et al.  An Anomaly Detection Based on Optimization , 2017 .

[26]  Massimo Pacella,et al.  Unsupervised classification of multichannel profile data using PCA: An application to an emission control system , 2018, Comput. Ind. Eng..

[27]  Li Pheng Khoo,et al.  A fuzzy c-means based hybrid evolutionary approach to the clustering of supply chain , 2013, Comput. Ind. Eng..