Evaluating Stream Classifiers with Delayed Labels Information

In general, data stream classifiers consider that the actual label of every unlabeled instance is available immediately after it issues a classification. The immediate availability of class labels allows the supervised monitoring of the data distribution and the error rate to verify whether the current classifier is outdated. Further, if a change is detected, the classifier has access to all recent labeled data to update the model. However, this assumption is very optimistic for most (if not all) applications. Given the costs and labor involved to obtain labels, failures in data acquisition or restrictions of the classification problem, a more reasonable assumption would be to consider the delayed availability of class labels. In this paper, we experimentally analyze the impact of latency on the performance of stream classifiers and call the attention of the community for the need to consider this critical variable in the evaluation process. We also make suggestions to avoid possible biased conclusions due to ignoring the delayed nature of stream problems. These are relevant contributions since few studies consider this variable in new algorithms proposals. Also, we propose a new evaluation measure (Kappa-Latency) that takes into account the arrival delay of actual labels to evaluate and compare a set of classifiers.

[1]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[2]  Ivan Bratko,et al.  Machine Learning by Function Decomposition , 1997, ICML.

[3]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[4]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[5]  Talel Abdessalem,et al.  Adaptive random forests for evolving data stream classification , 2017, Machine Learning.

[6]  Albert Bifet,et al.  Efficient Online Evaluation of Big Data Stream Classifiers , 2015, KDD.

[7]  João Gama,et al.  Evaluating algorithms that learn from data streams , 2009, SAC '09.

[8]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.

[9]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[10]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[11]  Robi Polikar,et al.  COMPOSE: A Semisupervised Learning Framework for Initially Labeled Nonstationary Streaming Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[12]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[13]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[14]  Geoff Holmes,et al.  Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them , 2013, ECML/PKDD.

[15]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[16]  Herna L. Viktor,et al.  A Framework for Classification in Data Streams Using Multi-strategy Learning , 2016, DS.

[17]  Indre Zliobaite,et al.  Change with Delayed Labeling: When is it Detectable? , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[18]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[19]  João Gama,et al.  Classification of Evolving Data Streams with Infinitely Delayed Labels , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[20]  João Gama,et al.  Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency , 2015, SDM.

[21]  Indre liobaite,et al.  Change with Delayed Labeling: When is it Detectable? , 2010, ICDM 2010.

[22]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[23]  Michaela M. Black,et al.  The Impact of Latency on Online Classification Learning with Concept Drift , 2010, KSEM.

[24]  Stan Matwin,et al.  Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test , 2016, KDD.

[25]  Eyke Hüllermeier,et al.  Open challenges for data stream mining research , 2014, SKDD.

[26]  Shankar Vembu,et al.  Chemical gas sensor drift compensation using classifier ensembles , 2012 .