Classification for concept-drifting data streams with limited amount of labeled data

Most existing concept-drifting data streams classification approaches assume that the true label of the instance in the data streams can be accessed right after it is classified and utilize it to detect concept drift as well as adjust the current model. It is impractical in real-world applications because manual labelling of data is both costly and time consuming. We apply a novel technique to overcome the problem mentioned above. The proposed method takes advantage of the model clusters generated by the fast KNNModel algorithm to classify the instances in the data streams. With the unlabeled testing instances, the arrival of a novel class and the drift in the underlying concept of a class are detected when the number of instances which are not covered by any model clusters increases rapidly at a certain significance level than that of before. The domain experts are asked to label a few instances to adjust the current model if and only if concept drift happens. Experimental results on both synthetic and real data streams show that compared with the traditional classification algorithms, our method acquires the comparable or better efficacy and efficiency using only a small amount of labelled data.