论文信息 - Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters

Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters

The enormous growth of data in a variety of applications has increased the need for high performance data mining based on distributed environments. However, standard data mining toolkits per se do not allow the usage of computing clusters. The success of MapReduce for analyzing large data has raised a general interest in applying this model to other, data intensive applications. Unfortunately current research has not lead to an integration of GUI based data mining toolkits with distributed file system based MapReduce systems. This paper defines novel principles for modeling and design of the user interface, the storage model and the computational model necessary for the integration of such systems. Additionally, it introduces a novel system architecture for interactive GUI based data mining of large data on clusters based on MapReduce that overcomes the limitations of data mining toolkits. As an empirical demonstration we show an implementation based on Weka and Hadoop.

[1] Domenico Talia,et al. Weka4WS: A WSRF-Enabled Weka Toolkit for Distributed Data Mining on Grids , 2005, PKDD.

[2] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[3] Ingo Mierswa,et al. YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[4] Arlo Faria,et al. MapReduce : Distributed Computing for Machine Learning , 2006 .

[5] María S. Pérez-Hernández,et al. Adapting the Weka Data Mining Toolkit to a Grid Based Environment , 2005, AWIC.

[6] GhemawatSanjay,et al. The Google file system , 2003 .

[7] Kunle Olukotun,et al. Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[8] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9] Stefan Rüping,et al. GridR: An R-Based Grid-Enabled Tool for Data Analysis in ACGT Clinico-Genomics Trials , 2007, Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007).