A Parallel Distributed Weka Framework for Big Data Mining Using Spark

Effective Big Data Mining requires scalable and efficient solutions that are also accessible to users of all levels of expertise. Despite this, many current efforts to provide effective knowledge extraction via large-scale Big Data Mining tools focus more on performance than on use and tuning which are complex problems even for experts. Weka is a popular and comprehensive Data Mining workbench with a well-known and intuitive interface, nonetheless it supports only sequential single-node execution. Hence, the size of the datasets and processing tasks that Weka can handle within its existing environment is limited both by the amount of memory in a single node and by sequential execution. This work discusses DistributedWekaSpark, a distributed framework for Weka which maintains its existing user interface. The framework is implemented on top of Spark, a Hadoop-related distributed framework with fast in-memory processing capabilities and support for iterative computations. By combining Weka's usability and Spark's processing power, DistributedWekaSpark provides a usable prototype distributed Big Data Mining workbench that achieves near-linear scaling in executing various real-world scale workloads - 91.4% weak scaling efficiency on average and up to 4x faster on average than Hadoop.

[1]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[2]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[3]  María S. Pérez-Hernández,et al.  Adapting the Weka Data Mining Toolkit to a Grid Based Environment , 2005, AWIC.

[4]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[5]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[6]  Stefan Wrobel,et al.  Toolkit-Based High-Performance Data Mining of Large Data on MapReduce Clusters , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[7]  Domenico Talia,et al.  Weka4WS: A WSRF-Enabled Weka Toolkit for Distributed Data Mining on Grids , 2005, PKDD.

[8]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[9]  Ralf Klinkenberg,et al.  Data Classification: Algorithms and Applications , 2014 .

[10]  Robert A. Muenchen,et al.  The Popularity of Data Analysis Software , 2013 .

[11]  Zoltán Prekopcsák,et al.  Radoop: Analyzing Big Data with RapidMiner and Hadoop , 2011 .

[12]  Scott Shenker,et al.  Disk-Locality in Datacenter Computing Considered Irrelevant , 2011, HotOS.

[13]  Peter J. Haas,et al.  Ricardo: integrating R and Hadoop , 2010, SIGMOD Conference.

[14]  Antony I. T. Rowstron,et al.  Scale-up vs scale-out for Hadoop: time to rethink? , 2013, SoCC.

[15]  Sherif Sakr,et al.  The family of mapreduce and large-scale data processing systems , 2013, CSUR.

[16]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[17]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[18]  Ian Witten,et al.  Data Mining , 2000 .

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[20]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[21]  Samuel P. Midkiff,et al.  RABID: A Distributed Parallel R for Large Datasets , 2014, 2014 IEEE International Congress on Big Data.