Active Learning With Drifting Streaming Data

In learning to classify streaming data, obtaining true labels may require major effort and may incur excessive cost. Active learning focuses on carefully selecting as few labeled instances as possible for learning an accurate predictive model. Streaming data poses additional challenges for active learning, since the data distribution may change over time (concept drift) and models need to adapt. Conventional active learning strategies concentrate on querying the most uncertain instances, which are typically concentrated around the decision boundary. Changes occurring further from the boundary may be missed, and models may fail to adapt. This paper presents a theoretically supported framework for active learning from drifting data streams and develops three active learning strategies for streaming data that explicitly handle concept drift. They are based on uncertainty, dynamic allocation of labeling efforts over time, and randomization of the search space. We empirically demonstrate that these strategies react well to changes that can occur anywhere in the instance space and unexpectedly.

[1]  Brian Mac Namee,et al.  Handling Concept Drift in a Text Data Stream Constrained by High Labelling Cost , 2010, FLAIRS.

[2]  Yisheng Dong,et al.  An active learning system for mining time-changing data streams , 2007, Intell. Data Anal..

[3]  Claudio Gentile,et al.  Worst-Case Analysis of Selective Sampling for Linear Classification , 2006, J. Mach. Learn. Res..

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  Xiaodong Lin,et al.  Active Learning from Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[7]  Latifur Khan,et al.  Facing the reality of data stream classification: coping with scarcity of labeled data , 2012, Knowledge and Information Systems.

[8]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[9]  Foster J. Provost,et al.  Online active inference and learning , 2011, KDD.

[10]  Shucheng Huang,et al.  An Active Learning Method for Mining Time-Changing Data Streams , 2008, 2008 Second International Symposium on Intelligent Information Technology Application.

[11]  John Yen,et al.  Relevant data expansion for learning concept drift from sparsely labeled data , 2005, IEEE Transactions on Knowledge and Data Engineering.

[12]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[13]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[14]  Brian Mac Namee,et al.  Drift Detection Using Uncertainty Distribution Divergence , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[15]  Concha Bielza,et al.  Mining Concept-Drifting Data Streams Containing Labeled and Unlabeled Instances , 2010, IEA/AIE.

[16]  Ralf Klinkenberg,et al.  Using Labeled and Unlabeled Data to Learn Drifting Concepts , 2007 .

[17]  Geoff Holmes,et al.  Active Learning with Evolving Streaming Data , 2011, ECML/PKDD.

[18]  Claude Sammut,et al.  Extracting Hidden Context , 1998, Machine Learning.

[19]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[20]  Xindong Wu,et al.  Mining Recurring Concept Drifts with Limited Labeled Streaming Data , 2010, TIST.

[21]  Philip S. Yu,et al.  Active Mining of Data Streams , 2004, SDM.

[22]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[23]  Li Guo,et al.  Mining Data Streams with Labeled and Unlabeled Training Examples , 2009, 2009 Ninth IEEE International Conference on Data Mining.