Performance Improvement of Data Mining in Weka through GPU Acceleration

Data mining tools may be computationally demanding, so there is an increasing interest on parallel computing strategies to improve their performance. The popularization of Graphics Processing Units (GPUs) increased the computing power of current desktop computers, but desktop-based data mining tools do not usually take full advantage of these architectures. This paper exploits an approach to improve the performance of Weka, a popular data mining tool, through parallelization on GPU-accelerated machines. From the profiling of Weka object-oriented code, we chose to parallelize a matrix multiplication method using state-of-the-art tools. The implementation was merged into Weka so that we could analyze the impact of parallel execution on its performance. The results show a significant speedup on the target parallel architectures, compared to the original, sequential Weka code.

[1]  Wei Jiang,et al.  MATE-CG: A Map Reduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[2]  Hans-Peter Kriegel,et al.  Position Prediction in CT Volume Scans , 2011 .

[3]  Michael Klemm,et al.  JCudaMP: OpenMP/Java on CUDA , 2010, IWMSE '10.

[4]  Ana T. Winck,et al.  Mining flexible-receptor docking experiments to select promising protein receptor snapshots , 2010, BMC Genomics.

[5]  Liria Matsumoto Sato,et al.  Exploiting idle cycles to execute data mining applications on clusters of PCs , 2007, J. Syst. Softw..

[6]  Hans-Peter Kriegel,et al.  2D Image Registration in CT Images Using Radial Image Descriptors , 2011, MICCAI.

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8]  Matthias Hauswirth,et al.  Evaluating the accuracy of Java profilers , 2010, PLDI '10.

[9]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[10]  Bingsheng He,et al.  Parallel Data Mining on Graphics Processors , 2011 .

[11]  Berkin Özisikyilmaz,et al.  High Performance Data Mining Using R on Heterogeneous Platforms , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[12]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[13]  Wojciech Zaremba,et al.  JaBEE: framework for object-oriented Java bytecode compilation and execution on graphics processor units , 2012, GPGPU-5.

[14]  María S. Pérez-Hernández,et al.  Adapting the Weka Data Mining Toolkit to a Grid Based Environment , 2005, AWIC.

[15]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  Ramakrishnan Kannan,et al.  NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce , 2011, KDD.

[18]  Meichun Hsu,et al.  GPU-Accelerated Large Scale Analytics , 2009 .

[19]  Gagan Agrawal,et al.  AUTO-GC: Automatic translation of data mining applications to GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[20]  Domenico Talia,et al.  Weka4WS: A WSRF-Enabled Weka Toolkit for Distributed Data Mining on Grids , 2005, PKDD.

[21]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[22]  Ian H. Witten,et al.  Induction of model trees for predicting continuous classes , 1996 .