Active Learning by Clustering for Drifted Data Stream Classification

Usually, during data stream classifier learning, we assume that labels of all incoming examples are available without any delay and they are used to update employing predictive model. Unfortunately, this assumption about access to all class labels is naive and it requires relatively high budget for labeling. It causes that methods which can train data stream classifiers on the basis of partially labeled data are highly desirable. Among them, active learning [1] seems to be a promising direction, which focuses on selecting only the most valuable learning examples to be labeled and used to produce an accurate predictive model. However, designing such a system we have to ensure that a chosen active learning strategy is able to handle changes in data distribution and quickly adapt to changing data distribution. In this work, we focus on novel active learning strategies that are designed for effective tackling of such changes. We propose a novel active data stream classifier learning method based on query by clustering approach. Experimental evaluation of the proposed methods prove the usefulness of the proposed approach for reducing labeling cost for classifier of drifting data streams.

[1]  Mohamed Medhat Gaber,et al.  Knowledge discovery from data streams , 2009, IDA 2009.

[2]  Indre Zliobaite,et al.  How good is the Electricity benchmark for evaluating concept drift adaptation , 2013, ArXiv.

[3]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[4]  Mykola Pechenizkiy,et al.  An Overview of Concept Drift Applications , 2016 .

[5]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[6]  Thomas Seidl,et al.  An effective evaluation measure for clustering on evolving data streams , 2011, KDD.

[7]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[8]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[9]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[10]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  MINAS: multiclass learning algorithm for novelty detection in data streams , 2016, Data Mining and Knowledge Discovery.

[11]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[12]  Dino Ienco,et al.  Clustering Based Active Learning for Evolving Data Streams , 2013, Discovery Science.

[13]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[14]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[15]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[16]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.