Parallel Classi cation for Data Mining on Shared-Memory Multiprocessors

We present parallel algorithms for building decision-tree classi ers on shared-memory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pipelining and dynamic load balancing to yield faster implementations. The task parallel approach uses dynamic subtree partitioning among processors. We evaluate the performance of these algorithms on two machine con gurations: one in which data is too large to t in memory and must be paged from a local disk as needed and the other in which memory is su ciently large to cache the whole data. This performance evaluation shows that the construction of a decision-tree classi er can be e ectively parallelized on an SMP machine with good speedup. For the local disk con guration, the speedup ranged from 2.97 to 3.86 for the build phase and from 2.20 to 3.67 for the total time on a 4-processor SMP. For the large memory con guration, the range of speedup was from 5.36 to 6.67 for the build phase and from 3.07 to 5.98 for the total time on an 8-processor SMP.

[1]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[2]  David J. DeWitt,et al.  A taxonomy of parallel sorting , 1984, CSUR.

[3]  Processor Self-Scheduling for Multiple-Nested Parallel Loops , 1986, ICPP.

[4]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[5]  Casimir A. Kulikowski,et al.  Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems , 1990 .

[6]  David J. DeWitt,et al.  Parallel sorting on a shared-nothing architecture using probabilistic splitting , 1991, [1991] Proceedings of the First International Conference on Parallel and Distributed Information Systems.

[7]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.

[8]  Salvatore J. Stolfo,et al.  Experiments on multistrategy learning by meta-learning , 1993, CIKM '93.

[9]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[10]  Multiprocessors Using Processor A � nity in Loop Scheduling on Shared Memory , 1994 .

[11]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[12]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[13]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[14]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.