Classifiers for Concept-drifting Data Streams: Evaluating Things That Really Matter

When evaluating the performance of a classifier for concept-drifting data streams, two factors are crucial: prediction accuracy and the ability to adapt. The first factor could be analyzed by a simple error-rate, which can be calculated using a holdout test set, chunks of examples, or incrementally after each example [1]. More recently, Gama [2] proposed prequential accuracy as a means of evaluating data stream classifiers and enhancing drift detection methods. For imbalanced data streams, Bifet and Frank [3] proposed the use of the Kappa statistic with a sliding window to assess the classifier’s predictive abilities. However, all of the aforementioned measures, when averaged over an entire stream, loose information about the classifier’s reactions to drifts. For example, an algorithm which has very high accuracy in periods of concept stability, but drastically looses on accuracy when drifts occur can still be characterized by higher overall accuracy than an algorithm which has lower accuracy between drifts, but reacts very well to changes. If we want our algorithm to react quickly to, e.g., market changes, we should choose the second algorithm, but to do so we would have to analyze the entire graphical plot of the classifier’s prequential accuracy, which cannot be easily automated and requires user interaction. To evaluate the second factor, the ability to adapt, separate methods are needed. Some researchers evaluate the classifier’s ability to adapt by comparing drift reaction times [4]. It is important to notice that in order to calculate reaction times, usually a human expert needs to determine moments when drifts start and stop. To automate the assessment of adaptability, Shaker and Hullermeier [5] proposed an approach, called recovery analysis, which uses synthetic datasets to calculate a classifier’s reaction time. A different evaluation method, which uses artificially generated datasets was proposed by Zliobaite [6]. The author put forward three controlled permutation techniques that create datasets which can help inform about the robustness of a classifier to variations in changes. However, approaches such as [5], which calculate absolute or relative drift reaction times, require external knowledge about drifts in real streams or the use of synthetic datasets and, therefore can only be used offline. Furthermore, reaction times are always calculated separately from accuracy which makes choosing the best