Machine learning for streaming data: state of the art, challenges, and opportunities

Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., Velocity and Volume. Given the current industry needs, there are many challenges to be addressed before existing methods can be efficiently applied to real-world problems. In this work, we focus on elucidating the connections among the current stateof- the-art on related fields; and clarifying open challenges in both academia and industry. We treat with special care topics that were not thoroughly investigated in past position and survey papers. This work aims to evoke discussion and elucidate the current research opportunities, highlighting the relationship of different subareas and suggesting courses of action when possible.

[1]  Geoff Holmes,et al.  Scalable and efficient multi-label classification for evolving data streams , 2012, Machine Learning.

[2]  João Gama,et al.  Self Hyper-Parameter Tuning for Data Streams , 2018, DS.

[3]  Blaine Nelson,et al.  Can machine learning be secure? , 2006, ASIACCS '06.

[4]  Grigorios Tsoumakas,et al.  An adaptive personalized news dissemination system , 2009, Journal of Intelligent Information Systems.

[5]  Xuegang Hu,et al.  Learning from concept drifting data streams with unlabeled data , 2012, Neurocomputing.

[6]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[7]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[8]  Reynold Xin,et al.  Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark , 2018, SIGMOD Conference.

[9]  Huan Liu,et al.  Advancing feature selection research , 2010 .

[10]  Li Wan,et al.  Heterogeneous Ensemble for Feature Drifts in Data Streams , 2012, PAKDD.

[11]  Gavin Brown,et al.  Measuring the Stability of Feature Selection , 2016, ECML/PKDD.

[12]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[13]  Seif Haridi,et al.  Apache Flink™: Stream and Batch Processing in a Single Engine , 2015, IEEE Data Eng. Bull..

[14]  João Gama,et al.  Very fast decision rules for classification in data streams , 2013, Data Mining and Knowledge Discovery.

[15]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[16]  Yue Dong,et al.  Threaded ensembles of autoencoders for stream learning , 2018, Comput. Intell..

[17]  Stan Matwin,et al.  Fast Unsupervised Online Drift Detection Using Incremental Kolmogorov-Smirnov Test , 2016, KDD.

[18]  Hamid Beigy,et al.  An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams , 2016, Knowledge and Information Systems.

[19]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[20]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[21]  J. Vanschoren,et al.  Scientific Workflow Management with ADAMS , 2012, ECML/PKDD.

[22]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[23]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[24]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[25]  Alan Wee-Chung Liew,et al.  Learning from Data Stream Based on Random Projection and Hoeffding Tree Classifier , 2017, 2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[26]  Colin Raffel,et al.  Realistic Evaluation of Deep Semi-Supervised Learning Algorithms , 2018, NeurIPS.

[27]  Xindong Wu,et al.  Data mining with big data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[28]  Latifur Khan,et al.  Rapidly Labeling and Tracking Dynamically Evolving Concepts in Data Streams , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[29]  Andrea Castelletti,et al.  An evaluation framework for input variable selection algorithms for environmental data-driven models , 2014, Environ. Model. Softw..

[30]  Indre Zliobaite,et al.  Change with Delayed Labeling: When is it Detectable? , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[31]  Sara J. Graves,et al.  A Coverage Based Ensemble Algorithm (CBEA) for streaming data , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[32]  Alexey Tsymbal,et al.  The problem of concept drift: definitions and related work , 2004 .

[33]  Sattar Hashemi,et al.  Adapted One-versus-All Decision Trees for Data Stream Classification , 2009, IEEE Transactions on Knowledge and Data Engineering.

[34]  Heiko Wersing,et al.  Incremental on-line learning: A review and comparison of state of the art algorithms , 2018, Neurocomputing.

[35]  David Barber,et al.  Bayesian reasoning and machine learning , 2012 .

[36]  Francisco Herrera,et al.  Big data preprocessing: methods and prospects , 2016 .

[37]  Latifur Khan,et al.  Detecting and Tracking Concept Class Drift and Emergence in Non-Stationary Fast Data Streams , 2015, AAAI.

[38]  Bhavani M. Thuraisingham,et al.  Evolving Big Data Stream Classification with MapReduce , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[39]  John R. Williams,et al.  Data-Stream-Based Intrusion Detection System for Advanced Metering Infrastructure in Smart Grid: A Feasibility Study , 2015, IEEE Systems Journal.

[40]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[41]  A. Bifet,et al.  Early Drift Detection Method , 2005 .

[42]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[43]  Wei Fan,et al.  Mining big data: current status, and forecast to the future , 2013, SKDD.

[44]  Robi Polikar,et al.  Learn$^{++}$ .NC: Combining Ensemble of Classifiers With Dynamically Weighted Consult-and-Vote for Efficient Incremental Learning of New Classes , 2009, IEEE Transactions on Neural Networks.

[45]  M.N.S. Swamy,et al.  Neural Networks and Statistical Learning , 2013 .

[46]  S. Muthukrishnan,et al.  Graphical Model Sketch , 2016, ECML/PKDD.

[47]  Talel Abdessalem,et al.  Scikit-Multiflow: A Multi-output Streaming Framework , 2018, J. Mach. Learn. Res..

[48]  João Gama,et al.  Recurrent concepts in data streams classification , 2013, Knowledge and Information Systems.

[49]  Lihong Li,et al.  Unbiased online active learning in data streams , 2011, KDD.

[50]  Gianmarco De Francisci Morales,et al.  VHT: Vertical hoeffding tree , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[51]  Albert Bifet,et al.  Massive Online Analysis , 2009 .

[52]  Latifur Khan,et al.  Novel Class Detection and Feature via a Tiered Ensemble Approach for Stream Mining , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.

[53]  Nitesh V. Chawla,et al.  An Incremental Learning Algorithm for Non-Stationary Environments and Class Imbalance , 2010 .

[54]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[55]  Gilles Louppe,et al.  Ensembles on Random Patches , 2012, ECML/PKDD.

[56]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[57]  Geoff Holmes,et al.  Leveraging Bagging for Evolving Data Streams , 2010, ECML/PKDD.

[58]  Elke A. Rundensteiner,et al.  Event Stream Processing with Out-of-Order Data Arrival , 2007, 27th International Conference on Distributed Computing Systems Workshops (ICDCSW'07).

[59]  João Gama,et al.  Learning about the Learning Process , 2011, IDA.

[60]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[61]  Philip S. Yu,et al.  RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection , 2014, 2014 IEEE International Conference on Data Mining.

[62]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[63]  Eyke Hüllermeier,et al.  Open challenges for data stream mining research , 2014, SKDD.

[64]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[65]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[66]  Arnd Christian König,et al.  Time Adaptive Sketches (Ada-Sketches) for Summarizing Data Streams , 2016, SIGMOD Conference.

[67]  Luís Torgo,et al.  SMOTE for Regression , 2013, EPIA.

[68]  Geoff Holmes,et al.  Evaluation methods and decision theory for classification of streaming data with temporal dependence , 2015, Machine Learning.

[69]  Mikhail Belkin,et al.  Beyond the point cloud: from transductive to semi-supervised learning , 2005, ICML.

[70]  Ludmila I. Kuncheva,et al.  A stability index for feature selection , 2007, Artificial Intelligence and Applications.

[71]  Jean Paul Barddal,et al.  A survey on feature drift adaptation: Definition, benchmark, challenges and future directions , 2017, J. Syst. Softw..

[72]  J. C. Schlimmer,et al.  Incremental learning from noisy data , 2004, Machine Learning.

[73]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  MINAS: multiclass learning algorithm for novelty detection in data streams , 2016, Data Mining and Knowledge Discovery.

[74]  Nan Jiang,et al.  Research issues in data stream association rule mining , 2006, SGMD.

[75]  C. Pinto,et al.  Partition Incremental Discretization , 2005, 2005 portuguese conference on artificial intelligence.

[76]  Ioannis Mitliagkas,et al.  Memory Limited, Streaming PCA , 2013, NIPS.

[77]  Haibo He,et al.  Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach , 2011, Evol. Syst..

[78]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[79]  Yaohang Li,et al.  Single-Pass PCA of Large High-Dimensional Data , 2017, IJCAI.

[80]  R. Perera Research methods journal club: a gentle introduction to imputation of missing values , 2008, Evidence-based medicine.

[81]  Charu C. Aggarwal,et al.  Stream Classification with Recurring and Novel Class Detection Using Class-Based Ensemble , 2012, 2012 IEEE 12th International Conference on Data Mining.

[82]  Mahsa Salehi,et al.  A Survey on Anomaly detection in Evolving Data: [with Application to Forest Fire Risk Prediction] , 2018, SKDD.

[83]  Jean Paul Barddal,et al.  Analyzing the Impact of Feature Drifts in Streaming Learning , 2015, ICONIP.

[84]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[85]  Ion Stoica,et al.  Ray RLLib: A Composable and Scalable Reinforcement Learning Library , 2017, NIPS 2017.

[86]  Albert Bifet,et al.  Efficient Online Evaluation of Big Data Stream Classifiers , 2015, KDD.

[87]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints , 2011, IEEE Transactions on Knowledge and Data Engineering.

[88]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[89]  Albert Bifet,et al.  MACHINE LEARNING FOR DATA STREAMS , 2018 .

[90]  Hadi Sadoghi Yazdi,et al.  Ensemble of online neural networks for non-stationary and imbalanced data streams , 2013, Neurocomputing.

[91]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[92]  Robi Polikar,et al.  Incremental Learning of Concept Drift in Nonstationary Environments , 2011, IEEE Transactions on Neural Networks.

[93]  Heitor Murilo Gomes,et al.  SAE: Social Adaptive Ensemble classifier for data streams , 2013, 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[94]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[95]  Gianmarco De Francisci Morales,et al.  SAMOA: scalable advanced massive online analysis , 2015, J. Mach. Learn. Res..

[96]  Michèle Sebag,et al.  Collaborative hyperparameter tuning , 2013, ICML.

[97]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[98]  Mahsa Salehi,et al.  Online Clustering for Evolving Data Streams with Online Anomaly Detection , 2018, PAKDD.

[99]  Honglak Lee,et al.  Online Incremental Feature Learning with Denoising Autoencoders , 2012, AISTATS.

[100]  Albert Bifet Classifier Concept Drift Detection and the Illusion of Progress , 2017, ICAISC.

[101]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[102]  Sudipto Guha,et al.  Robust Random Cut Forest Based Anomaly Detection on Streams , 2016, ICML.

[103]  Gaogang Xie,et al.  SF-sketch: A Fast, Accurate, and Memory Efficient Data Structure to Store Frequencies of Data Items , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[104]  Min-Ling Zhang,et al.  A Review on Multi-Label Learning Algorithms , 2014, IEEE Transactions on Knowledge and Data Engineering.

[105]  Luca Martino,et al.  Multi-label methods for prediction with sequential data , 2015, Pattern Recognit..

[106]  Heitor Murilo Gomes,et al.  SAE2: advances on the social adaptive ensemble classifier for data streams , 2014, SAC.

[107]  Geoffrey I. Webb,et al.  Characterizing concept drift , 2015, Data Mining and Knowledge Discovery.

[108]  Heitor Murilo Gomes,et al.  Streaming Random Patches for Evolving Data Stream Classification , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[109]  Gregory Ditzler,et al.  Incremental Learning of New Classes in Unbalanced Datasets: Learn + + .UDNC , 2010, MCS.

[110]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[111]  Francisco Herrera,et al.  A survey on data preprocessing for data stream mining: Current status and future directions , 2017, Neurocomputing.

[112]  Ralf Klinkenberg,et al.  Using Labeled and Unlabeled Data to Learn Drifting Concepts , 2007 .

[113]  Geoff Holmes,et al.  Handling Numeric Attributes in Hoeffding Trees , 2008, PAKDD.

[114]  Blaine Nelson,et al.  Adversarial machine learning , 2019, AISec '11.

[115]  Talel Abdessalem,et al.  Adaptive random forests for evolving data stream classification , 2017, Machine Learning.

[116]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[117]  Eyke Hüllermeier,et al.  On label dependence and loss minimization in multi-label classification , 2012, Machine Learning.

[118]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[119]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[120]  Jean Paul Barddal,et al.  A Survey on Ensemble Learning for Data Stream Classification , 2017, ACM Comput. Surv..

[121]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[122]  Eduard Ayguadé,et al.  Echo State Hoeffding Tree Learning , 2016, ACML.

[123]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[124]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[125]  Geoffrey I. Webb Contrary to Popular Belief Incremental Discretization can be Sound, Computationally Efficient and Extremely Useful for Streaming Data , 2014, 2014 IEEE International Conference on Data Mining.

[126]  Tapio Elomaa,et al.  Online ChiMerge Algorithm , 2012 .

[127]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[128]  Albert Bifet,et al.  Delayed labelling evaluation for data streams , 2019, Data Mining and Knowledge Discovery.

[129]  Bala Srinivasan,et al.  Activity Recognition with Evolving Data Streams , 2018, ACM Comput. Surv..

[130]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[131]  Niall M. Adams,et al.  Handling delayed labels in temporally evolving data streams , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[132]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[133]  Risto Miikkulainen,et al.  Efficient Reinforcement Learning Through Evolving Neural Network Topologies , 2002, GECCO.

[134]  Randy H. Katz,et al.  A Berkeley View of Systems Challenges for AI , 2017, ArXiv.