Decision Trees for Mining Data Streams Based on the McDiarmid's Bound

In mining data streams the most popular tool is the Hoeffding tree algorithm. It uses the Hoeffding's bound to determine the smallest number of examples needed at a node to select a splitting attribute. In the literature the same Hoeffding's bound was used for any evaluation function (heuristic measure), e.g., information gain or Gini index. In this paper, it is shown that the Hoeffding's inequality is not appropriate to solve the underlying problem. We prove two theorems presenting the McDiarmid's bound for both the information gain, used in ID3 algorithm, and for Gini index, used in Classification and Regression Trees (CART) algorithm. The results of the paper guarantee that a decision tree learning system, applied to data streams and based on the McDiarmid's bound, has the property that its output is nearly identical to that of a conventional learner. The results of the paper have a great impact on the state of the art of mining data streams and various developed so far methods and algorithms should be reconsidered.

[1]  Donald K. Wedding,et al.  Discovering Knowledge in Data, an Introduction to Data Mining , 2005, Inf. Process. Manag..

[2]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[3]  Jing Liu,et al.  Ambiguous decision trees for mining concept-drifting data streams , 2009, Pattern Recognit. Lett..

[4]  Geoff Holmes,et al.  New Options for Hoeffding Trees , 2007, Australian Conference on Artificial Intelligence.

[5]  Leszek Rutkowski,et al.  New Soft Computing Techniques for System Modeling, Pattern Classification and Image Processing , 2004 .

[6]  João Gama,et al.  Decision trees for mining data streams , 2006, Intell. Data Anal..

[7]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[8]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[9]  Albert Bifet,et al.  Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams , 2010, Frontiers in Artificial Intelligence and Applications.

[10]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[11]  Leszek Rutkowski,et al.  Adaptive probabilistic neural networks for pattern classification in time-varying environment , 2004, IEEE Transactions on Neural Networks.

[12]  Albert Bifet,et al.  DATA STREAM MINING A Practical Approach , 2009 .

[13]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[14]  GamaJoão,et al.  Decision trees for mining data streams , 2006 .

[15]  João Gama,et al.  Data Streams - Models and Algorithms , 2007, Advances in Database Systems.

[16]  Yi Ding,et al.  Collaborative filtering on streaming data with interest-drifting , 2007, Intell. Data Anal..

[17]  Rynson W. H. Lau,et al.  Knowledge and Data Engineering for e-Learning Special Issue of IEEE Transactions on Knowledge and Data Engineering , 2008 .

[18]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[19]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[20]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[21]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[22]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[23]  Philip S. Yu,et al.  Decision tree evolution using limited number of labeled data items from drifting data streams , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[24]  Richard Brendon Kirkby,et al.  Improving Hoeffding Trees , 2007 .

[25]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[26]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[27]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[28]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[29]  Leszek Rutkowski,et al.  Computational intelligence - methods and techniques , 2008 .

[30]  Leszek Rutkowski,et al.  Generalized regression neural networks in time-varying environment , 2004, IEEE Transactions on Neural Networks.

[31]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .