Delayed labelling evaluation for data streams

A large portion of the stream mining studies on classification rely on the availability of true labels immediately after making predictions. This approach is well exemplified by the test-then-train evaluation, where predictions immediately precede true label arrival. However, in many real scenarios, labels arrive with non-negligible latency. This raises the question of how to evaluate classifiers trained in such circumstances. This question is of particular importance when stream mining models are expected to refine their predictions between acquiring instance data and receiving its true label. In this work, we propose a novel evaluation methodology for data streams when verification latency takes place, namely continuous re-evaluation. It is applied to reference data streams and it is used to differentiate between stream mining techniques in terms of their ability to refine predictions based on newly arriving instances. Our study points out, discusses and shows empirically the importance of considering the delay of instance labels when evaluating classifiers for data streams.

[1]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints , 2011, IEEE Transactions on Knowledge and Data Engineering.

[2]  Indre Zliobaite,et al.  Change with Delayed Labeling: When is it Detectable? , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[3]  A. C. Liew,et al.  Demand forecasting using fuzzy neural computation, with special emphasis on weekend and public holiday forecasting , 1995 .

[4]  Thomas Seidl,et al.  MOA: A Real-Time Analytics Open Source Framework , 2011, ECML/PKDD.

[5]  Shyi-shiun Kuo,et al.  Hybrid learning algorithm based neural networks for short-term load forecasting , 2014, 2014 International Conference on Fuzzy Theory and Its Applications (iFUZZY2014).

[6]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[7]  João Gama,et al.  Adaptive Model Rules from Data Streams , 2013, ECML/PKDD.

[8]  João Gama,et al.  Issues in evaluation of stream learning algorithms , 2009, KDD.

[9]  Ludmila I. Kuncheva,et al.  Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[10]  Geoff Holmes,et al.  Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them , 2013, ECML/PKDD.

[11]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[12]  Talel Abdessalem,et al.  Adaptive random forests for evolving data stream classification , 2017, Machine Learning.

[13]  Geoff Holmes,et al.  Evaluation methods and decision theory for classification of streaming data with temporal dependence , 2015, Machine Learning.

[14]  Jean Paul Barddal,et al.  Adaptive random forests for data stream regression , 2018, ESANN.

[15]  Niall M. Adams,et al.  Handling delayed labels in temporally evolving data streams , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[16]  Albert Bifet,et al.  DATA STREAM MINING A Practical Approach , 2009 .

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[19]  Indre liobaite,et al.  Change with Delayed Labeling: When is it Detectable? , 2010, ICDM 2010.

[20]  João Gama,et al.  Classification of Evolving Data Streams with Infinitely Delayed Labels , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[21]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[22]  Hadi Fanaee-T,et al.  Event labeling combining ensemble detectors and background knowledge , 2014, Progress in Artificial Intelligence.

[23]  Bernard Zenko,et al.  Speeding-Up Hoeffding-Based Regression Trees With Options , 2011, ICML.

[24]  David B. Skillicorn,et al.  Classification Using Streaming Random Forests , 2011, IEEE Transactions on Knowledge and Data Engineering.

[25]  Nathan Marz,et al.  Big Data: Principles and best practices of scalable realtime data systems , 2015 .

[26]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[27]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[28]  Gregory Ditzler,et al.  Learning in Nonstationary Environments: A Survey , 2015, IEEE Computational Intelligence Magazine.

[29]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[30]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[31]  Albert Bifet,et al.  Efficient Online Evaluation of Big Data Stream Classifiers , 2015, KDD.