Accelerate K-means Algorithm by Using GPU in the Hadoop Framework

Cluster analysis, such as k-means algorithm, plays a critical role in data mining area, but now it is facing the computational challenge due to the continuously increasing data volume. Parallel computing becomes an efficient way to overcome the difficulty. In this paper, we use Graphics Processing Units (GPU) in the Hadoop framework to accelerate the k-means algorithm. As a result, our algorithm is about 10 times faster than the k-means implemented by Mahout.

[1]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[2]  Bingsheng He,et al.  Parallel Data Mining on Graphics Processors , 2011 .

[3]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .

[4]  F. Khunjush,et al.  A preliminary study of incorporating GPUs in the Hadoop framework , 2012, The 16th CSI International Symposium on Computer Architecture and Digital Systems (CADS 2012).

[5]  Meichun Hsu,et al.  Clustering billions of data points using GPUs , 2009, UCHPC-MAW '09.

[6]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[7]  Ali Ridho Barakbah,et al.  Hierarchical K-means: an algorithm for centroids initialization for K-means , 2007 .

[8]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[9]  Bingsheng He,et al.  Mars: Accelerating MapReduce with Graphics Processors , 2011, IEEE Transactions on Parallel and Distributed Systems.

[10]  Chunming Rong,et al.  K-means Clustering in the Cloud -- A Mahout Test , 2011, 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications.

[11]  Hubert Nguyen,et al.  GPU Gems 3 , 2007 .

[12]  Jiming Liu,et al.  Speeding up K-Means Algorithm by GPUs , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[13]  Sean Owen,et al.  Mahout in Action , 2011 .

[14]  Dinesh Manocha,et al.  Query co-processing on commodity processors , 2006, VLDB.