Parallel Prediction Algorithms for Heterogeneous Data: A Case Study with Real-Time Big Datasets

Parallel data mining algorithms are extensively used to mine and discover hidden knowledge from varied, unrelated data. Parallel data mining algorithms provide advantages such as reduced training time, less execution time, and less memory requirement. There are several issues in executing parallel data mining algorithms in a distributed environment. It is crucial to partition the data among processors such that there is minimal data dependency, proper synchronization, communication overhead, work load balancing among nodes in distributed processors and disk IO cost. Few of these issues can be resolved when parallel data mining algorithms are executed on Apache framework called Hadoop Map Reduce. Hadoop Map Reduce provides improved performance, reduced communication cost, reduced execution time, reduced training time, and reduced IO access. This paper proposes a novel framework that aims at enhancing the aforementioned advantages in terms of scalability by increasing the number of nodes in the Hadoop cluster and analyzing the performance of classification algorithms like K-Nearest Neighbor, Naive Bayes and Decision Tree. This parallel framework could be extended to other fields of biotechnology where prediction on large datasets is essential.

[1]  Wei Dai,et al.  A MapReduce Implementation of C4.5 Decision Tree Algorithm , 2014 .

[2]  S. Masih,et al.  Data Mining Techniques in Parallel and Distributed Environment – A Comprehensive Survey , 2014 .

[3]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[4]  Kaushik Roy,et al.  The k-Nearest Neighbor Algorithm Using MapReduce Paradigm , 2014, 2014 5th International Conference on Intelligent Systems, Modelling and Simulation.

[5]  Sujni Paul,et al.  Parallel and Distributed Data Mining , 2011 .

[6]  Jinlin Wang,et al.  Research on a Scalable Parallel Data Mining Algorithm , 2009, 2009 Fifth International Joint Conference on INC, IMS and IDC.

[7]  Domenico Talia,et al.  Parallelism in Knowledge Discovery Techniques , 2002, PARA.

[8]  Vijay D. Katkar,et al.  A novel parallel implementation of Naive Bayesian classifier for Big Data , 2013, 2013 International Conference on Green Computing, Communication and Conservation of Energy (ICGCE).

[9]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[10]  Xindong Wu,et al.  MReC4.5: C4.5 Ensemble Classification with MapReduce , 2009, 2009 Fourth ChinaGrid Annual Conference.

[11]  Hui Wang,et al.  Parallel Implementation of Classification Algorithms Based on Cloud Computing Environment , 2012 .

[12]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[13]  Kazuto Kubota,et al.  Parallelization of decision tree algorithm and its performance evaluation , 2000, Proceedings Fourth International Conference/Exhibition on High Performance Computing in the Asia-Pacific Region.