Incrementally Optimized Decision Tree for Mining Imperfect Data Streams

The Very Fast Decision Tree (VFDT) is one of the most important classification algorithms for real-time data stream mining. However, imperfections in data streams, such as noise and imbalanced class distribution, do exist in real world applications and they jeopardize the performance of VFDT. Traditional sampling techniques and post-pruning may be impractical for a non-stopping data stream. To deal with the adverse effects of imperfect data streams, we have invented an incremental optimization model that can be integrated into the decision tree model for data stream classification. It is called the Incrementally Optimized Very Fast Decision Tree (I-OVFDT) and it balances performance (in relation to prediction accuracy, tree size and learning time) and diminishes error and tree size dynamically. Furthermore, two new Functional Tree Leaf strategies are extended for I-OVFDT that result in superior performance compared to VFDT and its variant algorithms. Our new model works especially well for imperfect data streams. I-OVFDT is an anytime algorithm that can be integrated into those existing VFDT-extended algorithms based on Hoeffding bound in node splitting. The experimental results show that I-OVFDT has higher accuracy and more compact tree size than other existing data stream classification methods.

[1]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[2]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[3]  Richard Brendon Kirkby,et al.  Improving Hoeffding Trees , 2007 .

[4]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[5]  João Gama,et al.  Learning decision trees from dynamic data streams , 2005, SAC '05.

[6]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[7]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[8]  Carla E. Brodley,et al.  Pruning Decision Trees with Misclassification Costs , 1998, ECML.

[9]  Simon Fong,et al.  Moderated VFDT in Stream Mining Using Adaptive Tie Threshold and Incremental Pruning , 2011, DaWaK.

[10]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[11]  Xue Li,et al.  OcVFDT: one-class very fast decision tree for one-class classification of data streams , 2009, SensorKDD '09.

[12]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[13]  Sattar Hashemi,et al.  Flexible decision tree for data stream classification in the presence of concept change, noise and missing values , 2009, Data Mining and Knowledge Discovery.

[14]  Geoff Holmes,et al.  New Options for Hoeffding Trees , 2007, Australian Conference on Artificial Intelligence.

[15]  João Gama,et al.  Learning decision trees from dynamic data streams , 2005, SAC '05.

[16]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.