BOAI: Fast Alternating Decision Tree Induction Based on Bottom-Up Evaluation

Alternating Decision Tree (ADTree) is a successful classification model based on boosting and has a wide range of applications. The existing ADTree induction algorithms apply a "top-down" strategy to evaluate the best split at each boosting iteration, which is very time-consuming and thus is unsuitable for modeling on large data sets. This paper proposes a fast ADTree induction algorithm (BOAI) based on "bottom-up" evaluation, which offers high performance on massive data without sacrificing classification accuracy. BOAI uses a pre-sorting technique and dynamically evaluates splits by a bottom-up approach based on VW-group. With these techniques, huge redundancy in sorting and computation can be eliminated in the tree induction procedure. Experimental results on both real and synthetic data sets show that BOAI outperforms the best existing ADTree induction algorithm by a significant margin. In the real case study, BOAI also provides better performance than TreeNet and Random Forests, which are considered as efficient classification models.

[1]  Kyuseok Shim,et al.  PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning , 1998, Data Mining and Knowledge Discovery.

[2]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[5]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[6]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[7]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[8]  Geoff Holmes,et al.  Optimizing the Induction of Alternating Decision Trees , 2001, PAKDD.

[9]  Stephen T. C. Wong,et al.  Boosting alternating decision trees modeling of disease trait information , 2005, BMC Genetics.

[10]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[11]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[12]  Hendrik Blockeel,et al.  First order alternating decision trees , 2003 .

[13]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[14]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[15]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[16]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.