Parallel formulations of decision-tree classification algorithms

Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. We present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. Experimental results on an IBM SP-2 demonstrate excellent speedups and scalability.

[1]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[2]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[3]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[4]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[5]  Jason Catlett,et al.  Experiments on the Costs and Benefits of Windowing in ID3 , 1988, ML.

[6]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Salvatore J. Stolfo,et al.  Experiments on multistrategy learning by meta-learning , 1993, CIKM '93.

[9]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[10]  George Karypis,et al.  Introduction to Parallel Computing , 1994 .

[11]  Vijay P. Kumar,et al.  Analyzing Scalability of Parallel Algorithms and Architectures , 1994, J. Parallel Distributed Comput..

[12]  Vipin Kumar,et al.  Unstructured Tree Search on SIMD Parallel Computers , 1994, IEEE Trans. Parallel Distributed Syst..

[13]  Robert A. Pearson A Coarse Grained Parallel Induction Heuristic , 1994 .

[14]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[15]  Sanjay Ranka,et al.  Many-to-many personalized communication with bounded traffic , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[16]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[17]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[18]  Srinivas Aluru,et al.  Concatenated parallelism: a technique for efficient parallel divide and conquer , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[19]  Se June Hong,et al.  Use of Contextaul Information for Feature Ranking and Discretization , 1997, IEEE Trans. Knowl. Data Eng..

[20]  Sanjay Ranka,et al.  A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data , 1997, VLDB.

[21]  Richard Kufrin,et al.  Decision trees on parallel processors , 1997, Parallel Processing for Artificial Intelligence 3.

[22]  Moustafa Ghanem,et al.  Large Scale Data Mining: Challenges and Responses , 1997, KDD.

[23]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.