NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce

In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the Message Passing Interface (MPI) have been the de facto choices for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level and ill-suited for implementing ML-DM algorithms. To address this deficiency, we present NIMBLE, a portable infrastructure that has been specifically designed to enable the rapid implementation of parallel ML-DM algorithms. The infrastructure allows one to compose parallel ML-DM algorithms using reusable (serial and parallel) building blocks that can be efficiently executed using MR and other parallel programming models; it currently runs on top of Hadoop, which is an open-source MR implementation. We show how NIMBLE can be used to realize scalable implementations of ML-DM algorithms and present a performance evaluation.

[1]  Edward Y. Chang,et al.  Pfp: parallel fp-growth for query recommendation , 2008, RecSys '08.

[2]  Ruoming Jin,et al.  Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance , 2005, IEEE Transactions on Knowledge and Data Engineering.

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[7]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[10]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[11]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[12]  Michael Isard,et al.  DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language , 2008, OSDI.

[13]  Philip S. Yu,et al.  A general framework for accurate and fast regression by data summarization in random decision trees , 2006, KDD '06.

[14]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[15]  Andrew Lumsdaine,et al.  PFunc: modern task parallelism for modern high performance computing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[16]  Ruoming Jin,et al.  Shared Memory Paraellization of Data Mining Algorithms: Techniques, Programming Interface, and Performance. , 2002 .