Data Stream Classification Using Classifier Ensemble

For the contemporary business, the crucial factor is making smart decisions on the basis of the knowledge hidden in stored data. Unfortunately,m traditional simple methods of data analysis are not sufficient for efficient management of modern enterprizes, because they are not appropriate for the huge and growing amount of the data stored by them. Additionally data usually comes continuously in the form of so-called data stream. The great disadvantage of traditional classification methods is that they assume that statistical properties of the discovered concept are being unchanged, while in real situation, we could observe so-called concept drift, which could be caused by changes in the probabilities of classes or/and conditional probability distributions of classes. The potential for considering new training data is an important feature of machine learning methods used in security applications (spam filtering or intrusion detection) or decision support systems for marketing departments, which need to follow the changing client behavior. Unfortunately, the occurrence of concept drift dramatically decreases classification accuracy. This work presents the comprehensive study on the ensemble classifier approach applied to the problem of drifted data streams. Especially it reports the research on modifications of previously developed Weighted Aging Classifier Ensemble (WAE) algorithm, which is able to construct a valuable classifier ensemble for classification of incremental drifted stream data. We generalize WAE method and propose the general framework for this approach. Such framework can prune an classifier ensemble before or after assigning weights to individual classifiers. Additionally, we propose new classifier pruning criteria, weight calculation methods, and aging operators. We also propose rejuvenating operator, which is able to soften the aging effect, which could be useful, especially in the case if quite ”old” classifiers are high quality models, i.e., their presence increases ensemble accuracy, what could be found, e.g., in the case of recurring concept drift. The chosen characteristics of the proposed frameworks were evaluated on the basis of the wide range of computer experiments carried out on the two benchmark data streams. Obtained results confirmed the usability of proposed method to the data stream classification with the presence of incremental concept drift.

[1]  Michal Wozniak,et al.  Application of Combined Classifiers to Data Stream Classification , 2013, CISIM.

[2]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[3]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[4]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[5]  Ingrid Renz,et al.  Adaptive Information Filtering: Learning in the Presence of Concept Drifts , 1998 .

[6]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[9]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[10]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[11]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[12]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[13]  Michal Wozniak,et al.  Hybrid Classifiers - Methods of Data, Knowledge, and Classifier Combination , 2013, Studies in Computational Intelligence.

[14]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[15]  Thomas Seidl,et al.  MOA: A Real-Time Analytics Open Source Framework , 2011, ECML/PKDD.

[16]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[17]  Geoff Hulten,et al.  A General Framework for Mining Massive Data Streams , 2003 .

[19]  Mykola Pechenizkiy,et al.  Dynamic integration of classifiers for handling concept drift , 2008, Inf. Fusion.

[20]  Konrad Jackowski,et al.  Fixed-size ensemble classifier system evolutionarily adapted to a recurring context with an unlimited pool of classifiers , 2013, Pattern Analysis and Applications.