Handling Concept Drift in a Text Data Stream Constrained by High Labelling Cost

In many real-world classification problems the concept being modelled is not static but rather changes over time - a situation known as concept drift. Most techniques for handling concept drift rely on the true classifications of test instances being available shortly after classification so that classifiers can be retrained to handle the drift. However, in applications where labelling instances with their true class has a high cost this is not reasonable. In this paper we present an approach for keeping a classifier up-to-date in a concept drift domain which is constrained by a high cost of labelling. We use an active learning type approach to select those examples for labelling that are most useful in handling changes in concept. We show how this approach can adequately handle concept drift in a text filtering scenario requiring just 15% of the documents to be manually categorised and labelled.

[1]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[2]  Velappa Ganapathy,et al.  Neural network ensemble for financial trend prediction , 2000, 2000 TENCON Proceedings. Intelligent Systems and Technologies for the New Millennium (Cat. No.00CH37119).

[3]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[4]  Yisheng Dong,et al.  An active learning system for mining time-changing data streams , 2007, Intell. Data Anal..

[5]  J. C. Schlimmer,et al.  Incremental learning from noisy data , 2004, Machine Learning.

[6]  Xiaowei Xu,et al.  Representative Sampling for Text Classification Using Support Vector Machines , 2003, ECIR.

[7]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[8]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[9]  Mykola Pechenizkiy,et al.  Handling Local Concept Drift with Dynamic Integration of Classifiers: Domain of Antibiotic Resistance in Nosocomial Infections , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[10]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[11]  Xiaodong Lin,et al.  Active Learning from Data Streams , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[12]  Padraig Cunningham,et al.  A case-based technique for tracking concept drift in spam filtering , 2004, Knowl. Based Syst..

[13]  T. Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1999, ECML.

[14]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[15]  Ralf Klinkenberg Learning Drifting Concepts with Partial User Feedback , 1999 .

[16]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[17]  Carsten Lanquillon Information Filtering in Changing Domains , 1999, IJCAI 1999.

[18]  Ludmila I. Kuncheva,et al.  A framework for generating data to simulate changing environments , 2007, Artificial Intelligence and Applications.

[19]  Nicolas Saunier,et al.  Stream-Based Learning through Data Selection in a Road Safety Application⋆ , 2007 .

[20]  Philip S. Yu,et al.  Active Mining of Data Streams , 2004, SDM.

[21]  Miroslav Kubat Floating approximation in time-varying knowledge bases , 1989, Pattern Recognit. Lett..