Challenges in benchmarking stream learning algorithms with real-world data

Streaming data are increasingly present in real-world applications such as sensor measurements, satellite data feed, stock market, and financial data. The main characteristics of these applications are the online arrival of data observations at high speed and the susceptibility to changes in the data distributions due to the dynamic nature of real environments. The data stream mining community still faces some primary challenges and difficulties related to the comparison and evaluation of new proposals, mainly due to the lack of publicly available high quality non-stationary real-world datasets. The comparison of stream algorithms proposed in the literature is not an easy task, as authors do not always follow the same recommendations, experimental evaluation procedures, datasets, and assumptions. In this paper, we mitigate problems related to the choice of datasets in the experimental evaluation of stream classifiers and drift detectors. To that end, we propose a new public data repository for benchmarking stream algorithms with real-world data. This repository contains the most popular datasets from literature and new datasets related to a highly relevant public health problem that involves the recognition of disease vector insects using optical sensors. The main advantage of these new datasets is the prior knowledge of their characteristics and patterns of changes to adequately evaluate new adaptive algorithms. We also present an in-depth discussion about the characteristics, reasons, and issues that lead to different types of changes in data distribution, as well as a critical review of common problems concerning the current benchmark datasets available in the literature.

[1]  Nitesh V. Chawla,et al.  A Review on Quantification Learning , 2017, ACM Comput. Surv..

[2]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[3]  Kyong Joo Oh,et al.  Analyzing Stock Market Tick Data Using Piecewise Nonlinear Model , 2022 .

[4]  Hadi Sadoghi Yazdi,et al.  Recursive least square perceptron model for non-stationary and imbalanced data stream classification , 2013, Evol. Syst..

[5]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[6]  Cesare Alippi,et al.  Just-in-Time Adaptive Classifiers—Part I: Detecting Nonstationary Changes , 2008, IEEE Transactions on Neural Networks.

[7]  Indre Zliobaite,et al.  Change with Delayed Labeling: When is it Detectable? , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[8]  Gregory Ditzler,et al.  Incremental Learning of Concept Drift from Streaming Imbalanced Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[9]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[10]  Xindong Wu,et al.  10 Challenging Problems in Data Mining Research , 2006, Int. J. Inf. Technol. Decis. Mak..

[11]  KENNETH MELLANBY,et al.  Humidity and Insect Metabolism , 1936, Nature.

[12]  KlinkenbergRalf Learning drifting concepts: Example selection vs. example weighting , 2004 .

[13]  Daniel P. W. Ellis,et al.  Exploring Low Cost Laser Sensors to Identify Flying Insect Species , 2015, J. Intell. Robotic Syst..

[14]  Geoff Holmes,et al.  Leveraging Bagging for Evolving Data Streams , 2010, ECML/PKDD.

[15]  Richard Granger,et al.  Incremental Learning from Noisy Data , 1986, Machine Learning.

[16]  Xin Yao,et al.  A learning framework for online class imbalance learning , 2013, 2013 IEEE Symposium on Computational Intelligence and Ensemble Learning (CIEL).

[17]  Shankar Vembu,et al.  Chemical gas sensor drift compensation using classifier ensembles , 2012 .

[18]  Jerzy Stefanowski,et al.  Accuracy Updated Ensemble for Data Streams with Concept Drift , 2011, HAIS.

[19]  K TasoulisDimitris,et al.  Exponentially weighted moving average charts for detecting concept drift , 2012 .

[20]  Chaim Linhart,et al.  PAKDD Data Mining Competition 2009: New Ways of Using Known Methods , 2009, PAKDD Workshops.

[21]  Ann Q. Gates,et al.  TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING , 2005 .

[22]  Feng Huang,et al.  Robust Prototype-Based Learning on Data Streams , 2018, IEEE Transactions on Knowledge and Data Engineering.

[23]  Ludmila I. Kuncheva,et al.  Change Detection in Streaming Multivariate Data Using Likelihood Detectors , 2013, IEEE Transactions on Knowledge and Data Engineering.

[24]  Zoran Bosnic,et al.  Detecting concept drift in data streams using model explanation , 2018, Expert Syst. Appl..

[25]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[26]  João Gama,et al.  Ensemble learning for data stream analysis: A survey , 2017, Inf. Fusion.

[27]  C. Paupy,et al.  Aedes albopictus, an arbovirus vector: from the darkness to the light. , 2009, Microbes and infection.

[28]  Bhavani M. Thuraisingham,et al.  Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams , 2009, ECML/PKDD.

[29]  Heiko Wersing,et al.  Interactive online learning for obstacle classification on a mobile robot , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[30]  Indre liobaite,et al.  Change with Delayed Labeling: When is it Detectable? , 2010, ICDM 2010.

[31]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[32]  N. Gratz,et al.  Critical review of the vector status of Aedes albopictus , 2004, Medical and veterinary entomology.

[33]  Trevor Darrell,et al.  Adapting Visual Category Models to New Domains , 2010, ECCV.

[34]  A. Failloux,et al.  Phylogeography of Aedes (Stegomyia) aegypti (L.) and Aedes (Stegomyia) albopictus (Skuse) (Diptera: Culicidae) based on mitochondrial DNA variations. , 2005, Genetical research.

[35]  H. Hotelling The Generalization of Student’s Ratio , 1931 .

[36]  Gillian Dobbie,et al.  Detecting Volatility Shift in Data Streams , 2014, 2014 IEEE International Conference on Data Mining.

[37]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[38]  Raghunathan Rengaswamy,et al.  A review of process fault detection and diagnosis: Part I: Quantitative model-based methods , 2003, Comput. Chem. Eng..

[39]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[40]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[41]  Gustavo E. A. P. A. Batista,et al.  On the Need of Class Ratio Insensitive Drift Tests for Data Streams , 2018, LIDTA@ECML/PKDD.

[42]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[43]  Roberto Souto Maior de Barros,et al.  A comparative study on concept drift detectors , 2014, Expert Syst. Appl..

[44]  Vinícius M. A. de Souza,et al.  Asphalt pavement classification using smartphone accelerometer and Complexity Invariant Distance , 2018, Eng. Appl. Artif. Intell..

[45]  Ludmila I. Kuncheva,et al.  Nearest Neighbour Classifiers for Streaming Data with Delayed Labelling , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[46]  Nitesh V. Chawla,et al.  Noname manuscript No. (will be inserted by the editor) Learning from Streaming Data with Concept Drift and Imbalance: An Overview , 2022 .

[47]  Grigorios Tsoumakas,et al.  An adaptive personalized news dissemination system , 2009, Journal of Intelligent Information Systems.

[48]  Geoff Holmes,et al.  Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them , 2013, ECML/PKDD.

[49]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[50]  Geoffrey I. Webb,et al.  Analyzing concept drift and shift from sample data , 2018, Data Mining and Knowledge Discovery.

[51]  H. Zeller,et al.  A review of the invasive mosquitoes in Europe: ecology, public health risks, and control options. , 2012, Vector borne and zoonotic diseases.

[52]  Ludmila I. Kuncheva,et al.  Determining the Training Window for Small Sample Size Classification with Concept Drift , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[53]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[54]  J. C. Schlimmer,et al.  Incremental learning from noisy data , 2004, Machine Learning.

[55]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[56]  Haibo He,et al.  Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach , 2011, Evol. Syst..

[57]  Gregory Ditzler,et al.  Learning in Nonstationary Environments: A Survey , 2015, IEEE Computational Intelligence Magazine.

[58]  Juan José Rodríguez Diez,et al.  Combining univariate approaches for ensemble change detection in multivariate data , 2018, Inf. Fusion.

[59]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[60]  Geoffrey I. Webb,et al.  Extremely Fast Decision Tree , 2018, KDD.

[61]  Niall M. Adams,et al.  The impact of changing populations on classifier performance , 1999, KDD '99.

[62]  Vinícius M. A. de Souza,et al.  Asfault: A low-cost system to evaluate pavement conditions in real-time using smartphones and machine learning , 2018, Pervasive Mob. Comput..

[63]  Claude Sammut,et al.  Extracting Hidden Context , 1998, Machine Learning.

[64]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[65]  Albert Bifet,et al.  Adaptive learning and mining for data streams and frequent patterns , 2009, SKDD.

[66]  Vinícius M. A. de Souza Classification of non-stationary data stream with application in sensors for insect identification , 2016 .

[67]  Ludmila I. Kuncheva,et al.  A framework for generating data to simulate changing environments , 2007, Artificial Intelligence and Applications.

[68]  L. E. Chadwick,et al.  The effects of atmospheric pressure and composition on the flight of Drosophila. , 1949, The Biological bulletin.

[69]  João Gama,et al.  Classification of Evolving Data Streams with Infinitely Delayed Labels , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[70]  Robi Polikar,et al.  Quantifying the limited and gradual concept drift assumption , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[71]  João Gama,et al.  Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency , 2015, SDM.

[72]  Talel Abdessalem,et al.  Adaptive random forests for evolving data stream classification , 2017, Machine Learning.

[73]  Raghunathan Rengaswamy,et al.  A review of process fault detection and diagnosis: Part III: Process history based methods , 2003, Comput. Chem. Eng..

[74]  L MinkuLeandro,et al.  Ensemble learning for data stream analysis , 2017 .

[75]  Eamonn J. Keogh,et al.  On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[76]  Michal Wozniak,et al.  Concept Drift Detection and Model Selection with Simulated Recurrence and Ensembles of Statistical Detectors , 2013, J. Univers. Comput. Sci..

[77]  Robi Polikar,et al.  COMPOSE: A Semisupervised Learning Framework for Initially Labeled Nonstationary Streaming Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[78]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[79]  João Gama,et al.  Learning decision trees from dynamic data streams , 2005, SAC '05.

[80]  DitzlerGregory,et al.  Incremental Learning of Concept Drift from Streaming Imbalanced Data , 2013 .

[81]  Dimitris K. Tasoulis,et al.  Exponentially weighted moving average charts for detecting concept drift , 2012, Pattern Recognit. Lett..

[82]  Mikkel Brydegaard,et al.  Multiband modulation spectroscopy for the determination of sex and species of mosquitoes in flight , 2018, Journal of biophotonics.

[83]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[84]  Miklós Ajtai,et al.  The complexity of the Pigeonhole Principle , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[85]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[86]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[87]  Xin Yao,et al.  The Impact of Diversity on Online Ensemble Learning in the Presence of Concept Drift , 2010, IEEE Transactions on Knowledge and Data Engineering.

[88]  Wei Fan,et al.  Extremely Fast Decision Tree Mining for Evolving Data Streams , 2017, KDD.

[89]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[90]  Andrea Farina,et al.  Towards the use of bioresorbable fibers in time‐domain diffuse optics , 2018, Journal of biophotonics.

[91]  Francisco Herrera,et al.  A unifying view on dataset shift in classification , 2012, Pattern Recognit..

[92]  Gouri Deshpande,et al.  Analysis of the survey , 2002 .

[93]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[94]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[95]  Jilles Vreeken,et al.  Characterising the difference , 2007, KDD '07.

[96]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[97]  A. Matsumoto,et al.  Variability in circadian activity patterns within the Culex pipiens complex (Diptera: Culicidae). , 1994, Journal of medical entomology.

[98]  Lars Eisen,et al.  Aedes (Stegomyia) aegypti in the Continental United States: A Vector at the Cool Margin of Its Geographic Range , 2013, Journal of medical entomology.

[99]  Sung-Hyuk Cha,et al.  On measuring the distance between histograms , 2002, Pattern Recognit..

[100]  Geoff Holmes,et al.  Evaluation methods and decision theory for classification of streaming data with temporal dependence , 2015, Machine Learning.

[101]  Nitesh V. Chawla,et al.  SPECIAL ISSUE ON LEARNING FROM IMBALANCED DATA SETS , 2004 .

[102]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[103]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[104]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[105]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[106]  José Carlos Príncipe,et al.  Effective insect recognition using a stacked autoencoder with maximum correntropy criterion , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[107]  Koichiro Yamauchi,et al.  Detecting Concept Drift Using Statistical Testing , 2007, Discovery Science.

[108]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[109]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[110]  Khaled Ghédira,et al.  Discussion and review on evolving data streams and concept drift adapting , 2018, Evol. Syst..

[111]  Thorsten Joachims,et al.  Detecting Concept Drift with Support Vector Machines , 2000, ICML.

[112]  Gustavo E. A. P. A. Batista,et al.  Unsupervised context switch for classification tasks on data streams with recurrent concepts , 2018, SAC.

[113]  Peter A. Flach,et al.  A Response to Webb and Ting’s On the Application of ROC Analysis to Predict Classification Performance Under Varying Class Distributions , 2005, Machine Learning.

[114]  Robert M. Waterhouse,et al.  Pathogenomics of Culex quinquefasciatus and Meta-Analysis of Infection Responses to Diverse Pathogens , 2010, Science.

[115]  Alexander Vergara,et al.  On the calibration of sensor arrays for pattern recognition using the minimal number of experiments , 2014 .

[116]  J. Andel Sequential Analysis , 2022, The SAGE Encyclopedia of Research Design.

[117]  Agenor Mafra-Neto,et al.  SIGKDD demo: sensors and software to allow computational entomology, an emerging application of data mining , 2011, KDD.

[118]  Stan Matwin,et al.  Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test , 2016, KDD.

[119]  Gustavo E. A. P. A. Batista,et al.  Classifying and Counting with Recurrent Contexts , 2018, KDD.

[120]  Vinícius M. A. de Souza,et al.  Classification of Data Streams Applied to Insect Recognition: Initial Results , 2013, 2013 Brazilian Conference on Intelligent Systems.

[121]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[122]  Xin Yao,et al.  A Systematic Study of Online Class Imbalance Learning With Concept Drift , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[123]  Rui Wang,et al.  Towards social user profiling: unified and discriminative influence model for inferring home locations , 2012, KDD.

[124]  L. Harrington,et al.  The Impact of Temperature and Body Size on Fundamental Flight Tone Variation in the Mosquito Vector Aedes aegypti (Diptera: Culicidae): Implications for Acoustic Lures , 2017, Journal of Medical Entomology.

[125]  Roy A. Maxion,et al.  Why Did My Detector Do That?! - Predicting Keystroke-Dynamics Error Rates , 2010, RAID.

[126]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[127]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[128]  Raj Bhatnagar,et al.  Tracking recurrent concept drift in streaming data using ensemble classifiers , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[129]  Indre Zliobaite Controlled permutations for testing adaptive learning models , 2013, Knowledge and Information Systems.

[130]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[131]  Heiko Wersing,et al.  KNN Classifier with Self Adjusting Memory for Heterogeneous Concept Drift , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[132]  Geoffrey I. Webb,et al.  Survey of distance measures for quantifying concept drift and shift in numeric data , 2018, Knowledge and Information Systems.

[133]  Indre Zliobaite,et al.  Combining similarity in time and space for training set formation under concept drift , 2011, Intell. Data Anal..

[134]  Peter A. Flach,et al.  Patterns of dataset shift , 2014 .

[135]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[136]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[137]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[138]  Eamonn J. Keogh,et al.  Flying Insect Classification with Inexpensive Sensors , 2014, Journal of Insect Behavior.

[139]  ChaudhuriSurajit,et al.  On random sampling over joins , 1999 .

[140]  Indre Zliobaite,et al.  How good is the Electricity benchmark for evaluating concept drift adaptation , 2013, ArXiv.

[141]  Gustavo E. A. P. A. Batista,et al.  DyS: A Framework for Mixture Models in Quantification , 2019, AAAI.

[142]  Johannes Gehrke,et al.  A framework for measuring changes in data characteristics , 1999, PODS '99.

[143]  F. Oppacher,et al.  Evolutionary Data Mining With Automatic Rule Generalization , 2001 .

[144]  Zoran Bosni,et al.  Detecting concept drift in data streams using model explanation , 2018 .

[145]  L. Taylor,et al.  ANALYSIS OF THE EFFECT OF TEMPERATURE ON INSECTS IN FLIGHT , 1963 .