Online query by committee for active learning from drifting data streams

Most of data stream learning methods assume that a true class of an incoming instance is available right after it has been processed. However, assumption that we have an unlimited access to class labels is unrealistic and is directly connected with a very high labeling cost. This is a driving force behind growing development of methods that require reduced or no access to class labels. Among several potential directions active learning emerges as a promising solution, by allowing for a selection of most valuable instances from the stream and using as few label queries. Despite numerous proposals of active learning methods for static data, this domain is still developing for data streams. Here, non-stationary nature of data must be taken into consideration and proposed algorithms must accommodate potential occurrences of concept drift. In this paper we propose a Query by Committee active learning strategy that is adapted to online learning from drifting data streams. A decision regarding label query is made by an ensemble of classifiers instead of a single learner, leading to an improved instance selection. We present four different approaches for online Query by Committee and evaluate their usefulness on the basis of obtained accuracy with limited budgets and ability to handle concept drift. We introduce Budget Loss of Accuracy, a novel measure for evaluating active learning algorithms. Finally, we investigate the relationships between the efficacy of Query by Committee models and diversity of underlying ensembles. Based on thorough experimental investigation we are able to show the usefulness of proposed algorithms for reducing labeling effort in learning from drifting data streams.

[1]  Robi Polikar,et al.  COMPOSE: A Semisupervised Learning Framework for Initially Labeled Nonstationary Streaming Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Francisco Herrera,et al.  Evaluating the classifier behavior with noisy data considering performance and robustness: The Equalized Loss of Accuracy measure , 2016, Neurocomputing.

[3]  Mohamed Medhat Gaber,et al.  Advances in data stream mining , 2012, WIREs Data Mining Knowl. Discov..

[4]  Geoff Holmes,et al.  Active Learning With Drifting Streaming Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Sun Online Ensemble Learning of Data Streams with Gradually Evolved Classes , 2016 .

[6]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[7]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[8]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[9]  Sebastián Ventura,et al.  A Parallel Genetic Programming Algorithm for Classification , 2011, HAIS.

[10]  Raymond J. Mooney,et al.  Diverse ensembles for active learning , 2004, ICML.

[11]  Philip S. Yu,et al.  Active Learning: A Survey , 2014, Data Classification: Algorithms and Applications.

[12]  Yolande Belaïd,et al.  An adaptive streaming active learning strategy based on instance weighting , 2016, Pattern Recognit. Lett..

[13]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[14]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[15]  Jerzy Stefanowski,et al.  Ensemble Diversity in Evolving Data Streams , 2016, DS.

[16]  Michal Wozniak,et al.  Active learning approach to concept drift problem , 2012, Log. J. IGPL.

[17]  Michal Wozniak,et al.  Concept Drift Detection and Model Selection with Simulated Recurrence and Ensembles of Statistical Detectors , 2013, J. Univers. Comput. Sci..

[18]  Geoff Holmes,et al.  Fast Perceptron Decision Tree Learning from Evolving Data Streams , 2010, PAKDD.

[19]  Xiaodong Lin,et al.  Active Learning From Stream Data Using Optimal Weight Classifier Ensemble , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[20]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[21]  Latifur Khan,et al.  Facing the reality of data stream classification: coping with scarcity of labeled data , 2012, Knowledge and Information Systems.