K-Means Parallel Acceleration for Sparse Data Dimensions on Flink

The K-means algorithm is a clustering algorithm which widely used in various applications, and it's running time is dramatically increased as the data size expanded. When the volume of data exceeds the range that can be carried by a single machine, the parallel operation of the algorithm must be implemented by using a distributed computing framework. Generally, during the parallel operation of the task, there are differences among the running time of each task due to the data skew, and the running progress of the entire job is determined by the task with the longest running time. In this paper, we propose an optimal data partitioning method for the application of the k-means algorithm on the sparsely dimensioned dataset to eliminate the data skew problem and further accelerate the parallel execution of the algorithm. Experimental evaluation on large-scale text datasets demonstrate the effectiveness of our partitioning approach on Flink.

[1]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[2]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[3]  Juby Mathew,et al.  Scalable parallel clustering approach for large data using parallel K means and firefly algorithms , 2014, 2014 International Conference on High Performance Computing and Applications (ICHPCA).

[4]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[5]  François Fleuret,et al.  Nested Mini-Batch K-Means , 2016, NIPS.

[6]  Jiming Liu,et al.  Speeding up K-Means Algorithm by GPUs , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[7]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[8]  Guangwen Yang,et al.  Large-Scale Hierarchical k-means for Heterogeneous Many-Core Supercomputers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Kenli Li,et al.  GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data , 2016, IEEE Transactions on Parallel and Distributed Systems.

[10]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[11]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[12]  Sanjay Agrawal,et al.  A Performance Analysis of MapReduce Task with Large Number of Files Dataset in Big Data Using Hadoop , 2014, 2014 Fourth International Conference on Communication Systems and Network Technologies.

[13]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[14]  Sean Owen,et al.  Mahout in Action , 2011 .

[15]  Chitresh Verma,et al.  Big Data representation for grade analysis through Hadoop framework , 2016, 2016 6th International Conference - Cloud System and Big Data Engineering (Confluence).

[16]  E. Sivaraman,et al.  High Performance and Fault Tolerant Distributed File System for Big Data Storage and Processing Using Hadoop , 2014, 2014 International Conference on Intelligent Computing Applications.

[17]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[18]  V. Santhi,et al.  Performance Analysis of Parallel K-Means with Optimization Algorithms for Clustering on Spark , 2018, ICDCIT.