Prequential AUC for Classifier Evaluation and Drift Detection in Evolving Data Streams

Detecting and adapting to concept drifts make learning data stream classifiers a difficult task. It becomes even more complex when the distribution of classes in the stream is imbalanced. Currently, proper assessment of classifiers for such data is still a challenge, as existing evaluation measures either do not take into account class imbalance or are unable to indicate class ratio changes in time. In this paper, we advocate the use of the area under the ROC curve (AUC) in imbalanced data stream settings and propose an efficient incremental algorithm that uses a sorted tree structure with a sliding window to compute AUC using constant time and memory. Additionally, we experimentally verify that this algorithm is capable of correctly evaluating classifiers on imbalanced streams and can be used as a basis for detecting changes in class definitions and imbalance ratio.

[1]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[2]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[3]  Joelle Pineau,et al.  Online Ensemble Learning for Imbalanced Data Streams , 2013, ArXiv.

[4]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[5]  Jerzy Stefanowski,et al.  Combining block-based and online methods in learning ensembles from concept drifting data streams , 2014, Inf. Sci..

[6]  Vipin Kumar,et al.  Chapman & Hall/CRC Data Mining and Knowledge Discovery Series , 2008 .

[7]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[8]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[9]  Nitesh V. Chawla,et al.  Adaptive Methods for Classification in Arbitrarily Imbalanced and Drifting Data Streams , 2009, PAKDD Workshops.

[10]  Mohak Shah,et al.  Evaluating Learning Algorithms: A Classification Perspective , 2011 .

[11]  Nitesh V. Chawla,et al.  Learning in non-stationary environments with class imbalance , 2012, KDD.

[12]  Tom Fawcett,et al.  Using rule sets to maximize ROC performance , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[13]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[14]  Eyke Hüllermeier,et al.  Open challenges for data stream mining research , 2014, SKDD.

[15]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[16]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[17]  Chaim Linhart,et al.  PAKDD Data Mining Competition 2009: New Ways of Using Known Methods , 2009, PAKDD Workshops.

[18]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[19]  Peter A. Flach,et al.  An Improved Model Selection Heuristic for AUC , 2007, ECML.

[20]  Gregory Ditzler,et al.  Incremental Learning of Concept Drift from Streaming Imbalanced Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[21]  Remco R. Bouckaert,et al.  Efficient AUC Learning Curve Calculation , 2006, Australian Conference on Artificial Intelligence.

[22]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[23]  Rudolf Bayer,et al.  Symmetric binary B-Trees: Data structure and maintenance algorithms , 1972, Acta Informatica.

[24]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[25]  Albert Bifet,et al.  Sentiment Knowledge Discovery in Twitter Streaming Data , 2010, Discovery Science.

[26]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[27]  Geoff Holmes,et al.  Evaluation methods and decision theory for classification of streaming data with temporal dependence , 2015, Machine Learning.