A Streaming Parallel Decision Tree Algorithm

We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm's accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy.

[1]  Imrich Chlamtac,et al.  The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations , 1985, CACM.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[4]  Rakesh Agrawal,et al.  A One-Pass Space-Efficient Algorithm for Finding Quantiles , 1995, COMAD.

[5]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[6]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[7]  Vipin Kumar,et al.  ScalParC: a new scalable and efficient parallel classification algorithm for mining large datasets , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[8]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[9]  Sanjay Ranka,et al.  CLOUDS: A Decision Tree Classifier for Large Datasets , 1998, KDD.

[10]  Girija J. Narlikar,et al.  A Parallel, Multithreaded Decision Tree Builder , 1998 .

[11]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[12]  Michael Werman,et al.  An On-Line Agglomerative Clustering Method for Nonstationary Data , 1999, Neural Computation.

[13]  Alok N. Choudhary,et al.  Efficient Parallel Classification Using Dimensional Aggregates , 1999, Large-Scale Parallel Data Mining.

[14]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[15]  Yishay Mansour,et al.  On the Boosting Ability of Top-Down Decision Tree Learning Algorithms , 1999, J. Comput. Syst. Sci..

[16]  Sanjay Ranka,et al.  Parallel out-of-core divide-and-conquer techniques with application to classification trees , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[17]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[18]  João Gama,et al.  Parallel Implementation of Decision Tree Learning Algorithms , 2001, EPIA.

[19]  Alípio Mário Jorge,et al.  Proceedings of the10th Portuguese Conference on Artificial Intelligence on Progress in Artificial Intelligence, Knowledge Extraction, Multi-agent Systems, Logic Programming and Constraint Solving , 2001 .

[20]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[21]  Ruoming Jin,et al.  Communication and Memory Efficient Parallel Decision Tree Construction , 2003, SDM.

[22]  Yannis E. Ioannidis,et al.  The History of Histograms (abridged) , 2003, VLDB.

[23]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.

[24]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[25]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[26]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[27]  Xuemin Lin,et al.  Continuously maintaining order statistics over data streams: extended abstract , 2007 .

[28]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[29]  Xuemin Lin,et al.  Continuously Maintaining Order Statistics over Data Streams , 2007, ADC.

[30]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..