A Data Science and Engineering Solution for Fast K-Means Clustering of Big Data

With advances in technology, high volumes of a wide variety of valuable data of different veracity can be easily collected or generated at a high velocity in the current era of big data. Embedded in these big data are implicit, previously unknown and potentially useful information. Hence, fast and scalable big data science and engineering solutions that mine and discover knowledge from these big data are in demand. A popular and practical data mining task is to group similar data into clusters (i.e., clustering). To cluster very large data or big data, k-means based algorithms have been widely used. Although many existing k-means algorithms give quality results, they also suffer from some problems. For instance, there are risks associated with randomly selecting the k centroids, there is a tendency to produce roughly equal circular clusters, and the runtime complexity is very high. To deal with these problems, we present in this paper a big data science and engineering solution that applies heuristic prototype-based algorithm. Evaluation results show the efficiency and scalability of this solution.

[1]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[3]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[4]  Greg Hamerly,et al.  Accelerating Lloyd’s Algorithm for k -Means Clustering , 2015 .

[5]  Alfredo Cuzzocrea,et al.  An innovative majority voting mechanism in interactive social network clustering , 2017, WIMS.

[6]  Jose Miguel Puerta,et al.  Scalable Learning of k-dependence Bayesian Classifiers under MapReduce , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[7]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  Nicolas Lachiche,et al.  Reframing in Clustering , 2016, 2016 IEEE 28th International Conference on Tools with Artificial Intelligence (ICTAI).

[10]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[11]  Wookey Lee,et al.  Categorical Data Skyline Using Classification Tree , 2011, APWeb.

[12]  Alexandros Iosifidis,et al.  2015 IEEE Trustcom/BigDataSE/ISPA , 2016, Big Data 2016.

[13]  Sylvain Chartier,et al.  The k-means clustering technique: General considerations and implementation in Mathematica , 2013 .

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  Sebastián Ventura,et al.  Subgroup Discovery on Big Data: Exhaustive Methodologies Using Map-Reduce , 2016, 2016 IEEE Trustcom/BigDataSE/ISPA.