Concept Drift in Decision Trees Learning from Data Streams

ABSTRACT: This paper presents the Ultra Fast Forest of Trees (UFFT) system. It is an incremental algorithm that works online, processing each example in constant time, and performing a single scan over the training examples. The system has been designed for numerical data. It uses analytical techniques to choose the splitting criteria, and the information gain to estimate the merit of each possible splitting-test. For multi-class problems the algorithm builds a binary tree for each possible pair of classes, leading to a forest of trees. To detect concept drift, we maintain, at each inner node, a naive-Bayes classifier. Statistical theory states that while the distribution of the examples is stationary, the online error of naive-Bayes will decrease, otherwise, the test installed at this node is not appropriate for the actual distribution of the examples. When this occurs, the entire sub tree rooted at this node is pruned. The use of naive-Bayes classifiers at leaves to classify test examples, and the use of naive-Bayes classifiers at decision nodes to detect changes in the distribution of the examples are directly obtained from the sufficient statistics required to compute the splitting criteria, without any additional computations. This aspect is a main advantage in the context of high-speed data streams. The experimental results show a good performance at the change of concept detection and also with learning the new concept. KEYWORDS: Concept Drift, Forest of Trees, Data Streams.

[1]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[2]  KlinkenbergRalf Learning drifting concepts: Example selection vs. example weighting , 2004 .

[3]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Richard A. Becker,et al.  The New S Language , 1989 .

[6]  W. Loh,et al.  SPLIT SELECTION METHODS FOR CLASSIFICATION TREES , 1997 .

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  Josef Kittler,et al.  Combining classifiers: A theoretical framework , 1998, Pattern Analysis and Applications.

[9]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[10]  Johannes Fürnkranz,et al.  Round Robin Classification , 2002, J. Mach. Learn. Res..

[11]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[12]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[13]  João Gama,et al.  Forest trees for on-line data , 2004, SAC '04.

[14]  Gerhard Widmer,et al.  Adapting to Drift in Continuous Domains (Extended Abstract) , 1995, ECML.

[15]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[16]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[17]  Geoff Hulten,et al.  Catching up with the Data: Research Issues in Mining Data Streams , 2001, DMKD.

[18]  David G. Stork,et al.  Pattern Classification , 1973 .

[19]  Ingrid Renz,et al.  Adaptive Information Filtering: Learning in the Presence of Concept Drifts , 1998 .

[20]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[21]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.