Distributed decision tree v.2.0

Decision Tree is a state-of-the-art classification and prediction algorithm in machine learning which constructs tree-structured set of attributes. Its distributed implementation, i.e. Distributed Decision Tree generates a specified number of trees (depending upon number of partitions of input dataset) and at the end collects votes or averages the prediction or classification. Here, the overall idea of achieving parallelism depends upon number of partitions. Parallelism can be achived by proper tuning of number of partitions. However, this kind of setup in-turn leads to a problem of compromise in accuracy, because there is always a tradeoff between accuracy and size of partition. Therefore, in this paper, we have proposed an improved Distributed Decision Tree algorithm to achieve true parallelism without loss in accuracy. The improved Distributed Decision Tree is implemented using open-source distributed frameworks Hadoop and Spark. We measure learning time, size of tree and accuracy to set up benchmarking using medium to large datasets.

[1]  Wei Dai,et al.  A MapReduce Implementation of C4.5 Decision Tree Algorithm , 2014 .

[2]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[3]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[4]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[5]  Vasile PURDIL,et al.  MR-Tree-A Scalable MapReduce Algorithm for Building Decision Trees , 2014 .

[6]  J. Ross Quinlan,et al.  Simplifying decision trees , 1987, Int. J. Hum. Comput. Stud..

[7]  Sanjay Chaudhary,et al.  Distributed Decision Tree , 2016, COMPUTE.

[8]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[9]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[10]  Xindong Wu,et al.  MReC4.5: C4.5 Ensemble Classification with MapReduce , 2009, 2009 Fourth ChinaGrid Annual Conference.

[11]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[12]  Nitesh V. Chawla,et al.  A parallel decision tree builder for mining very large visualization datasets , 2000, SMC.

[13]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Jeff Cooper,et al.  Improved algorithms for distributed boosting , 2017, 2017 55th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[16]  GhemawatSanjay,et al.  The Google file system , 2003 .

[17]  Salvatore J. Stolfo,et al.  The application of AdaBoost for distributed, scalable and on-line learning , 1999, KDD '99.

[18]  Ausif Mahmood,et al.  Highly Scalable, Parallel and Distributed AdaBoost Algorithm using Light Weight Threads and Web Services on a Network of Multi-Core Machines , 2013, ArXiv.

[19]  Zoran Obradovic,et al.  Boosting Algorithms for Parallel and Distributed Learning , 2022 .

[20]  Zhaohui Zheng,et al.  Stochastic gradient boosted distributed decision trees , 2009, CIKM.

[21]  J. Ross Quinlan,et al.  Decision trees and decision-making , 1990, IEEE Trans. Syst. Man Cybern..

[22]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[23]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[24]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[25]  Indranil Palit,et al.  Scalable and Parallel Boosting with MapReduce , 2012, IEEE Transactions on Knowledge and Data Engineering.