A New Decision Tree Classification Method for Mining High-Speed Data Streams Based on Threaded Binary Search Trees

One of most important algorithms for mining data streams is VFDT. It uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed. Gama et al. have extended VFDT in two directions. Their system VFDTc can deal with continuous data and use more powerful classification techniques at tree leaves. In this paper, we revisit this problem and implemented a system VFDTt on top of VFDT and VFDTc. We make the following three contributions: 1) we present a threaded binary search trees (TBST) approach for efficiently handling continuous attributes. It builds a threaded binary search tree, and its processing time for values inserting is O(nlogn), while VFDT's processing time is O(n$sup2$esup). When a new example arrives, VFDTc need update O(logn) attribute tree nodes, but VFDTt just need update one necessary node.2) we improve the method of getting the best split-test point of a given continuous attribute. Comparing to the method used in VFDTc, it improves from O(nlogn) to O (n) in processing time. 3) Comparing to VFDTc, VFDTt's candidate split-test number decrease from O(n) to O(logn). Comparing to VFDT, the most relevant property of our system is an average reduction of 25.53% in processing time, while keep the same tree size and accuracy. Overall, the techniques introduced here significantly improve the efficiency of decision tree classification on data streams.

[1]  Gerhard Lakemeyer,et al.  KI 2002: Advances in Artificial Intelligence , 2002, Lecture Notes in Computer Science.

[2]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[3]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Michaela M. Black,et al.  Maintaining the performance of a learned classifier under concept drift , 1999, Intell. Data Anal..

[6]  Jennifer Widom,et al.  STREAM: the stanford stream data manager (demonstration description) , 2003, SIGMOD '03.

[7]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[8]  Zhoujun Li,et al.  An Efficient Classification System Based on Binary Search Trees for Data Streams Mining , 2007, Second International Conference on Systems (ICONS'07).

[9]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[10]  Oded Maimon Knowledge Discovery and Data Mining : The Info-Fuzzy Network (IFN) Methodology , 2000 .

[11]  LastMark Online classification of nonstationary data streams , 2002 .

[12]  Wei-Pang Yang,et al.  An Efficient and Sensitive Decision Tree Approach to Mining Concept-Drifting Data Streams , 2008, Informatica.

[13]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[14]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[15]  Usama M. Fayyad,et al.  On the Handling of Continuous-Valued Attributes in Decision Tree Generation , 1992, Machine Learning.

[16]  Cezary Z. Janikow,et al.  Fuzzy decision trees: issues and methods , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[17]  Wei Fan StreamMiner: A Classifier Ensemble-based Engine to Mine Concept-drifting Data Streams , 2004, VLDB.

[18]  Andrew W. Moore,et al.  Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation , 1993, NIPS.

[19]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[20]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[21]  Leonard Adelman,et al.  Examining the effects of cognitive consistency between training and displays , 1998, IEEE Trans. Syst. Man Cybern. Part A.

[22]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[23]  Niall M. Adams,et al.  The impact of changing populations on classifier performance , 1999, KDD '99.

[24]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[25]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[26]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[27]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[28]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[29]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[30]  Steffen Hölldobler,et al.  Incremental Fuzzy Decision Trees , 2002, KI.