The PerfSim Algorithm for Concept Drift Detection in Imbalanced Data

There is currently a surge of interest in adaptive learning algorithms for applications ranging from ozone level peak predictions, learning stock market indicators, and detecting smart phone usage patterns. In such scenarios, the detection of change (or drift) in the concept being learned is important to ensure that correct, timely and relevant models are constructed. In addition, such data is often imbalanced and, to further complicate the issue, we are frequently interested in learning the minority class. It follows that ignoring these two aspects during learning may lead to unreliable, or even incorrect, models being built. In this research we discuss the interplay between concept drift detection and imbalanced data sets in order to ensure reliable results. We introduce a novel algorithm that, rather than considering a single performance evaluation measure such as accuracy for change detection, considers all the components of a confusion matrix and employs the cosine similarity coefficient. We evaluate our algorithm against a real world mobile phone database, as well as benchmarking datasets, and we compare it with two other state-of-the-art methods. The results show that our approach is particularly sensitive to concept drifts occurring in imbalanced data sets. Our evaluation indicates that our algorithm is able to detect concept drift reliably. Further, our method is shown to perform very well compared to the other techniques, especially when the drift occurs in the minority class of a class imbalance problem.

[1]  Anton Dries,et al.  Adaptive concept drift detection , 2009, SDM.

[2]  Arno Siebes,et al.  StreamKrimp: Detecting Change in Data Streams , 2008, ECML/PKDD.

[3]  Ronald L. Rivest,et al.  Learning Time-Varying Concepts , 1990, NIPS.

[4]  Shui-Lung Chuang,et al.  Taxonomy generation for text segments: A practical web-based approach , 2005, TOIS.

[5]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[6]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[7]  Herna L. Viktor,et al.  A Comparative Evaluation of Proximity Measures for Spectral Clustering , 2011, KDIR.

[8]  N. H. Anderson,et al.  Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates , 1994 .

[9]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[10]  Nitesh V. Chawla,et al.  An Incremental Learning Algorithm for Non-Stationary Environments and Class Imbalance , 2010 .

[11]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .

[12]  Hisashi Kashima,et al.  Unsupervised Change Analysis Using Supervised Learning , 2008, PAKDD.

[13]  Andrew R. Webb,et al.  Statistical Pattern Recognition: Webb/Statistical Pattern Recognition , 2011 .

[14]  Gregory Ditzler,et al.  An ensemble based incremental learning framework for concept drift and class imbalance , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[15]  Andrew R. Webb,et al.  Statistical Pattern Recognition , 1999 .

[16]  Robi Polikar,et al.  An Ensemble Approach for Incremental Learning in Nonstationary Environments , 2007, MCS.

[17]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[18]  Philip S. Yu,et al.  Classifying Data Streams with Skewed Class Distributions and Concept Drifts , 2008, IEEE Internet Computing.

[19]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[20]  Steve Chien,et al.  Semantic similarity between search engine queries using temporal correlation , 2005, WWW '05.

[21]  Philip M. Long,et al.  Tracking drifting concepts by minimizing disagreements , 2004, Machine Learning.

[22]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[23]  Herna L. Viktor,et al.  Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach , 2004, SKDD.

[24]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[25]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[26]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[27]  Mehran Sahami,et al.  Evaluating similarity measures: a large-scale study in the orkut social network , 2005, KDD '05.

[28]  Marcus A. Maloof,et al.  Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts , 2007, J. Mach. Learn. Res..

[29]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.