论文信息 - Parallelizing K-Means Algorithm for 1-D Data Using MPI

Parallelizing K-Means Algorithm for 1-D Data Using MPI

Nowadays, colossal amount of information is produced by computational systems and electronic instruments such as telescopes, medical devices and so on. To explore these petabytes of data, new fast algorithms must be discovered or old ones may be redesigned. One of the most popular and useful techniques in order to discover and extract information from data pools is clustering, and k-means is an algorithm which clusters data according its characteristics. Its main disadvantage is its computational complexity which makes the technique very difficult to apply on big data sets. Although k-means is a very well studied technique, a fully parallel version of it has not been explored yet. In this work, a parallel version of the k-means is presented for 1-d objects. The experimental results obtained are inline with the theoretical outcome and prove both the correctness and the effectiveness of the technique.

Ilias K. Savvas | Georgia N. Sofianidou

[1] Ruoming Jin,et al. Fast and exact out-of-core and distributed k-means clustering , 2006, Knowledge and Information Systems.

[2] Andrew McCallum,et al. Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[3] Jitendra Kumar,et al. Parallel k-Means Clustering for Quantitative Ecoregion Delineation Using Large Data Sets , 2011, ICCS.

[4] Alexander S. Szalay,et al. Data-Intensive Computing in the 21st Century , 2008, Computer.

[5] M. Tahar Kechadi,et al. Mining on the Cloud - K-means with MapReduce , 2012, CLOSER.

[6] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[8] Chandrabose Aravindan,et al. Strategies for Parallelizing KMeans Data Clustering Algorithm , 2011 .

[9] Jing Zhang,et al. A Parallel K-Means Clustering Algorithm with MPI , 2011, 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming.

[10] Qing He,et al. Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[11] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.