Decision Trees for Mining Data Streams Based on the Gaussian Approximation

Since the Hoeffding tree algorithm was proposed in the literature, decision trees became one of the most popular tools for mining data streams. The key point of constructing the decision tree is to determine the best attribute to split the considered node. Several methods to solve this problem were presented so far. However, they are either wrongly mathematically justified (e.g., in the Hoeffding tree algorithm) or time-consuming (e.g., in the McDiarmid tree algorithm). In this paper, we propose a new method which significantly outperforms the McDiarmid tree algorithm and has a solid mathematical basis. Our method ensures, with a high probability set by the user, that the best attribute chosen in the considered node using a finite data sample is the same as it would be in the case of the whole data stream.

[1]  Jing Liu,et al.  Ambiguous decision trees for mining concept-drifting data streams , 2009, Pattern Recognit. Lett..

[2]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[3]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[4]  Albert Bifet,et al.  DATA STREAM MINING A Practical Approach , 2009 .

[5]  Leszek Rutkowski,et al.  Computational intelligence - methods and techniques , 2008 .

[6]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[7]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[8]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[9]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[10]  O. Kardaun,et al.  Classical Methods of Statistics , 2005 .

[11]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[12]  Christian Borgelt,et al.  Computational Intelligence , 2016, Texts in Computer Science.

[13]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[14]  Leszek Rutkowski,et al.  Adaptive probabilistic neural networks for pattern classification in time-varying environment , 2004, IEEE Transactions on Neural Networks.

[15]  Ruoming Jin,et al.  Efficient decision tree construction on streaming data , 2003, KDD '03.

[16]  GamaJoão,et al.  Decision trees for mining data streams , 2006 .

[17]  Philip S. Yu,et al.  Active Mining of Data Streams , 2004, SDM.

[18]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[19]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[20]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[21]  Piotr Duda,et al.  Decision Trees for Mining Data Streams Based on the McDiarmid's Bound , 2013, IEEE Transactions on Knowledge and Data Engineering.

[22]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[23]  Albert Bifet,et al.  Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams , 2010, Frontiers in Artificial Intelligence and Applications.

[24]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[25]  M. Narasimha Murty,et al.  Pattern Recognition - An Algorithmic Approach , 2011, Undergraduate Topics in Computer Science.

[26]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[27]  Geoff Holmes,et al.  New Options for Hoeffding Trees , 2007, Australian Conference on Artificial Intelligence.

[28]  Donald K. Wedding,et al.  Discovering Knowledge in Data, an Introduction to Data Mining , 2005, Inf. Process. Manag..

[29]  João Gama,et al.  Decision trees for mining data streams , 2006, Intell. Data Anal..

[30]  João Gama,et al.  Data Streams - Models and Algorithms , 2007, Advances in Database Systems.

[31]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[32]  Jiawei Han,et al.  On Appropriate Assumptions to Mine Data Streams: Analysis and Practice , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[33]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[34]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.