Improving Hoeffding Trees

Modern information technology allows information to be collected at a far greater rate than ever before. So fast, in fact, that the main problem is making sense of it all. Machine learning offers promise of a solution, but the field mainly focusses on achieving high accuracy when data supply is limited. While this has created sophisticated classification algorithms, many do not cope with increasing data set sizes. When the data set sizes get to a point where they could be considered to represent a continuous supply, or data stream, then incremental classification algorithms are required. In this setting, the effectiveness of an algorithm cannot simply be assessed by accuracy alone. Consideration needs to be given to the memory available to the algorithm and the speed at which data is processed in terms of both the time taken to predict the class of a new data sample and the time taken to include this sample in an incrementally updated classification model. The Hoeffding tree algorithm is a state-of-the-art method for inducing decision trees from data streams. The aim of this thesis is to improve this algorithm. To measure improvement, a comprehensive framework for evaluating the performance of data stream algorithms is developed. Within the framework memory size is fixed in order to simulate realistic application scenarios. In order to simulate continuous operation, classes of synthetic data are generated providing an evaluation on a large scale. Improvements to many aspects of the Hoeffding tree algorithm are demonstrated. First, a number of methods for handling continuous numeric features are compared. Second, tree prediction strategy is investigated to evaluate the utility of various methods. Finally, the possibility of improving accuracy using ensemble methods is explored. The experimental results provide meaningful comparisons of accuracy and processing speeds between different modifications of the Hoeffding tree algorithm under various memory limits. The study on numeric attributes demonstrates that sacrificing accuracy for space at the local level often results in improved global accuracy. The prediction strategy shown to perform best adaptively chooses between standard majority class and Naive Bayes prediction in the leaves. The ensemble method investigation shows that combining trees can be worthwhile, but only when sufficient memory is available, and improvement is less likely than in traditional machine learning. In particular, issues are encountered when applying the popular boosting method to streams.

[1]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[2]  Salvatore J. Stolfo,et al.  Mining in a data-flow environment: experience in network intrusion detection , 1999, KDD '99.

[3]  Sanjay Ranka,et al.  A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data , 1997, VLDB.

[4]  Stuart J. Russell,et al.  Experimental comparisons of online and batch versions of bagging and boosting , 2001, KDD '01.

[5]  Leslie G. Valiant,et al.  Cryptographic Limitations on Learning Boolean Formulae and Finite Automata , 1993, Machine Learning: From Theory to Applications.

[6]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[7]  Yoav Freund,et al.  Discussion of the paper "Arcing Classifiers" by Leo Breiman , 1998 .

[8]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[9]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[10]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[11]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[12]  Donald D. Chamberlin,et al.  Access Path Selection in a Relational Database Management System , 1989 .

[13]  Dmitry Gavinsky,et al.  On Boosting with Polynomially Bounded Distributions , 2002, J. Mach. Learn. Res..

[14]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[15]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[16]  Robert L. Grossman,et al.  Data Mining for Scientific and Engineering Applications , 2001, Massive Computing.

[17]  Myra Spiliopoulou,et al.  The Laborious Way From Data Mining to Web Log Mining , 1999 .

[18]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[19]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[20]  H. S. Chandrashekar,et al.  Packet sniffing: a brief introduction , 2003 .

[21]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[22]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[23]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[24]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[25]  Carlo Zaniolo,et al.  An Adaptive Nearest Neighbor Classification Algorithm for Data Streams , 2005, PKDD.

[26]  Remco R. Bouckaert,et al.  Choosing Between Two Learning Algorithms Based on Calibrated Tests , 2003, ICML.

[27]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[28]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[29]  Paul E. Utgoff,et al.  Decision Tree Induction Based on Efficient Tree Restructuring , 1997, Machine Learning.

[30]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[31]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[32]  L. Breiman Pasting Bites Together For Prediction In Large Data Sets And On-Line , 1996 .

[33]  Huan Liu,et al.  Handling concept drifts in incremental learning with support vector machines , 1999, KDD '99.

[34]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[35]  Gary M. Weiss Data Mining in Telecommunications , 2005, The Data Mining and Knowledge Discovery Handbook.

[36]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[37]  Lei Liu,et al.  MobiMine: monitoring the stock market from a PDA , 2002, SKDD.

[38]  Salvatore J. Stolfo,et al.  The application of AdaBoost for distributed, scalable and on-line learning , 1999, KDD '99.

[39]  L. Breiman Arcing Classifiers , 1998 .

[40]  Kun Liu,et al.  VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring , 2004, SDM.

[41]  Geoff Hulten,et al.  Mining complex models from arbitrarily large databases in constant time , 2002, KDD.

[42]  J. Ian Munro,et al.  Selection and sorting with limited storage , 1978, 19th Annual Symposium on Foundations of Computer Science (sfcs 1978).

[43]  J. R. Quinlan Miniboosting Decision Trees , 1999 .

[44]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[45]  Qin Ding,et al.  k-nearest Neighbor Classification on Spatial Data Streams Using P-trees , 2002, PAKDD.

[46]  Daniel S. Hirschberg,et al.  On the Complexity of Learning Decision Trees , 1996 .

[47]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[48]  Remco R. Bouckaert Voting Massive Collections of Bayesian Network Classifiers for Data Streams , 2006, Australian Conference on Artificial Intelligence.

[49]  Javier Jaén Martínez,et al.  Data Management in an International Data Grid Project , 2000, GRID.

[50]  Steven Salzberg,et al.  Lookahead and Pathology in Decision Tree Induction , 1995, IJCAI.

[51]  Jesús S. Aguilar-Ruiz,et al.  Data streams classification by incremental rule learning with parameterized generalization , 2006, SAC '06.

[52]  M. Tamer Özsu,et al.  A Web page prediction model based on click-stream tree representation of user behavior , 2003, KDD '03.

[53]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[54]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[55]  B. Welford Note on a Method for Calculating Corrected Sums of Squares and Products , 1962 .

[56]  Paul E. Utgoff,et al.  Incremental Induction of Decision Trees , 1989, Machine Learning.

[57]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[58]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[59]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[60]  David J. Hand,et al.  Mining Personal Banking Data to Detect Fraud , 2007 .

[61]  João Gama,et al.  Stream-Based Electricity Load Forecast , 2007, PKDD.

[62]  J. Hilden Statistical diagnosis based on conditional independence does not require it. , 1984, Computers in biology and medicine.

[63]  Mark Last,et al.  Online classification of nonstationary data streams , 2002, Intell. Data Anal..

[64]  Osamu Watanabe,et al.  MadaBoost: A Modification of AdaBoost , 2000, COLT.

[65]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[66]  Ron Kohavi,et al.  Option Decision Trees with Majority Votes , 1997, ICML.

[67]  Mohamed Medhat Gaber,et al.  A fuzzy approach for interpretation of ubiquitous data stream clustering and its application in road safety , 2007, Intell. Data Anal..

[68]  Thomas G. Dietterich Machine-Learning Research Four Current Directions , 1997 .

[69]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[70]  David J. Hand,et al.  An Empirical Comparison of Three Boosting Algorithms on Real Data Sets with Artificial Class Noise , 2003, Multiple Classifier Systems.

[71]  Michael K. Ng,et al.  Data-Mining Massive Time Series Astronomical Data Sets - A Case Study , 1998, PAKDD.

[72]  Jesús S. Aguilar-Ruiz,et al.  Discovering decision rules from numerical data streams , 2004, SAC '04.

[73]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[74]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[75]  Hisashi Nakamura,et al.  Mining Geophysical Data for Knowledge , 1996, IEEE Expert.

[76]  Thomas G. Dietterich,et al.  Pruning Adaptive Boosting , 1997, ICML.

[77]  Ian F. Akyildiz,et al.  Sensor Networks , 2002, Encyclopedia of GIS.

[78]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[79]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[80]  Mohamed Medhat Gaber,et al.  A Survey of Classification Methods in Data Streams , 2007, Data Streams - Models and Algorithms.

[81]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[82]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[83]  João Gama,et al.  Discretization from data streams: applications to histograms and data mining , 2006, SAC.

[84]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[85]  Rakesh Agrawal,et al.  A One-Pass Space-Efficient Algorithm for Finding Quantiles , 1995, COMAD.

[86]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[87]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[88]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[89]  Eyke Hüllermeier,et al.  An Efficient Algorithm for Instance-Based Learning on Data Streams , 2007, ICDM.

[90]  Jiawei Han,et al.  Data Mining for Web Intelligence , 2002, Computer.

[91]  Geoff Holmes,et al.  Stress-Testing Hoeffding Trees , 2005, PKDD.

[92]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[93]  Ronald L. Rivest,et al.  Constructing Optimal Binary Decision Trees is NP-Complete , 1976, Inf. Process. Lett..

[94]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[95]  João Gama,et al.  Forest trees for on-line data , 2004, SAC '04.

[96]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1990, COLT '90.

[97]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[98]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[99]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[100]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[101]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[102]  Geoffrey I. Webb,et al.  The Need for Low Bias Algorithms in Classification Learning from Large Data Sets , 2002, PKDD.

[103]  Ron Kohavi,et al.  Applications of Data Mining to Electronic Commerce , 2000, Springer US.

[104]  Mohamed Medhat Gaber,et al.  On-board Mining of Data Streams in Sensor Networks , 2005 .

[105]  Tomasz Imielinski,et al.  Wireless Graffiti - Data, Data Everywhere Matters , 2002, VLDB.

[106]  Tony F. Chan,et al.  Computing standard deviations: accuracy , 1979, CACM.

[107]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[108]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[109]  Robert Givan,et al.  Online Ensemble Learning: An Empirical Study , 2000, Machine Learning.

[110]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[111]  Yoav Freund,et al.  An Adaptive Version of the Boost by Majority Algorithm , 1999, COLT.

[112]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[113]  Carlo Zaniolo,et al.  Fast and Light Boosting for Adaptive Mining of Data Streams , 2004, PAKDD.

[114]  Imrich Chlamtac,et al.  The P2 algorithm for dynamic calculation of quantiles and histograms without storing observations , 1985, CACM.