Decision Tree Induction Methods for Distributed Environment

Since the amount of information is rapidly growing, there is an overwhelming interest in efficient distributed computing systems including Grids, public-resource computing systems, P2P systems and cloud computing. In this paper we take a detailed look at the problem of modeling and optimization of network computing systems for parallel decision tree induction methods. First, we present a comprehensive discussion on mentioned induction methods with a special focus on their parallel versions. Next, we propose a generic optimization model of a network computing system that can be used for distributed implementation of parallel decision trees. To illustrate our work we provide results of numerical experiments showing that the distributed approach enables significant improvement of the system throughput.

[1]  Ruoming Jin,et al.  Communication and Memory Efficient Parallel Decision Tree Construction , 2003, SDM.

[2]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[3]  Stefan Wrobel,et al.  Machine Learning: ECML-95 , 1995, Lecture Notes in Computer Science.

[4]  Georgios Paliouras,et al.  The Effect of Numeric Features on the Scalability of Inductive Learning Programs , 1995, ECML.

[5]  Onur Dikmen,et al.  Parallel univariate decision trees , 2007, Pattern Recognit. Lett..

[6]  Ion Stoica,et al.  Peer-to-Peer Systems II , 2003, Lecture Notes in Computer Science.

[7]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[8]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[9]  Deep Medhi,et al.  Routing, flow, and capacity design in communication and computer networks , 2004 .

[10]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[11]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[12]  Richard Kufrin,et al.  Decision trees on parallel processors , 1997, Parallel Processing for Artificial Intelligence 3.

[13]  Thomas M. Cover,et al.  The Best Two Independent Measurements Are Not the Two Best , 1974, IEEE Trans. Syst. Man Cybern..

[14]  Jarek Nabrzyski,et al.  Grid resource management: state of the art and future trends , 2004 .

[15]  Ian T. Foster,et al.  On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing , 2003, IPTPS.

[16]  David G. Stork,et al.  Pattern Classification , 1973 .

[17]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[18]  Jarek Nabrzyski,et al.  Grid Resource Management , 2004 .

[19]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[20]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.

[21]  Ian J. Taylor From P2P to Web Services and Grids - Peers in a Client/Server World , 2005, Computer Communications and Networks.

[22]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[23]  P. Utgoff,et al.  Multivariate Decision Trees , 1995, Machine Learning.

[24]  M. Kurzynski The optimal strategy of a tree classifier , 1983 .