Data stream analysis: Foundations, major tasks and tools

The significant growth of interconnected Internet‐of‐Things (IoT) devices, the use of social networks, along with the evolution of technology in different domains, lead to a rise in the volume of data generated continuously from multiple systems. Valuable information can be derived from these evolving data streams by applying machine learning. In practice, several critical issues emerge when extracting useful knowledge from these potentially infinite data, mainly because of their evolving nature and high arrival rate which implies an inability to store them entirely. In this work, we provide a comprehensive survey that discusses the research constraints and the current state‐of‐the‐art in this vibrant framework. Moreover, we present an updated overview of the latest contributions proposed in different stream mining tasks, particularly classification, regression, clustering, and frequent patterns.

[1]  Ricard Gavaldà,et al.  Mining adaptively frequent closed unlabeled rooted trees in data streams , 2008, KDD.

[2]  David B. Skillicorn,et al.  Streaming Random Forests , 2007, 11th International Database Engineering and Applications Symposium (IDEAS 2007).

[3]  Albert Bifet,et al.  Deep learning in partially-labeled data streams , 2015, SAC.

[4]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[5]  Caiming Zhang,et al.  A Review of Research Relevant to the Emerging Industry Trends: Industry 4.0, IoT, Blockchain, and Business Analytics , 2020 .

[6]  Wee Keong Ng,et al.  A survey on data stream clustering and classification , 2015, Knowledge and Information Systems.

[7]  Albert Bifet,et al.  On Ensemble Techniques for Data Stream Regression , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[8]  Geoff Holmes,et al.  Fast Perceptron Decision Tree Learning from Evolving Data Streams , 2010, PAKDD.

[9]  Albert Bifet,et al.  Performance measures for evolving predictions under delayed labelling classification , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).

[10]  Heiko Wersing,et al.  Incremental on-line learning: A review and comparison of state of the art algorithms , 2018, Neurocomputing.

[11]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[12]  Ying Wah Teh,et al.  On Density-Based Data Streams Clustering Algorithms: A Survey , 2014, Journal of Computer Science and Technology.

[13]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[14]  Sabeur Aridhi,et al.  A Comparative Study on Streaming Frameworks for Big Data , 2018, LADaS@VLDB.

[15]  Marin Ferecatu,et al.  Evolutive Deep Models for Online Learning on Datastreams with no Storage , 2017, IOTSTREAMING@PKDD/ECML.

[16]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[17]  Geoff Holmes,et al.  Efficient data stream classification via probabilistic adaptive windows , 2013, SAC '13.

[18]  Olawande Daramola,et al.  Big data stream analysis: a systematic literature review , 2019, Journal of Big Data.

[19]  Marcos De Oliveira,et al.  A message classifier based on multinomial Naive Bayes for online social contexts , 2018 .

[20]  Geoffrey Holmes,et al.  Batch-Incremental Learning for Mining Data Streams , 2004 .

[21]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[22]  Saso Dzeroski,et al.  Online tree-based ensembles and option trees for regression on evolving data streams , 2015, Neurocomputing.

[23]  Heitor Murilo Gomes,et al.  Streaming Random Patches for Evolving Data Stream Classification , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[24]  Geoff Holmes,et al.  Batch-Incremental versus Instance-Incremental Learning in Dynamic and Evolving Data , 2012, IDA.

[25]  Albert Bifet,et al.  Massive Online Analysis , 2009 .

[26]  Hsuan-Tien Lin,et al.  An Online Boosting Algorithm with Theoretical Justifications , 2012, ICML.

[27]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[28]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[29]  Albert Bifet,et al.  Survey on Feature Transformation Techniques for Data Streams , 2020, IJCAI.

[30]  Alberto D. Pascual-Montano,et al.  A survey of dimensionality reduction techniques , 2014, ArXiv.

[31]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[32]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[33]  Thomas Seidl,et al.  An effective evaluation measure for clustering on evolving data streams , 2011, KDD.

[34]  Talel Abdessalem,et al.  Scikit-Multiflow: A Multi-output Streaming Framework , 2018, J. Mach. Learn. Res..

[35]  Philip S. Yu,et al.  Catch the moment: maintaining closed frequent itemsets over a data stream sliding window , 2006, Knowledge and Information Systems.

[36]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[37]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[38]  Silviu Maniu,et al.  A Sketch-Based Naive Bayes Algorithms for Evolving Data Streams , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[39]  Parikshit N. Mahalle,et al.  Data Stream Clustering Techniques, Applications, and Models: Comparative Analysis and Discussion , 2018, Big Data Cogn. Comput..

[40]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[41]  Weiru Chen,et al.  The Modeling and Simulation of Data Clustering Algorithms in Data Mining with Big Data , 2018 .

[42]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[43]  João Gama,et al.  Machine learning for streaming data: state of the art, challenges, and opportunities , 2019, SKDD.

[44]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[45]  João Gama,et al.  Hierarchical Clustering of Time-Series Data Streams , 2008, IEEE Transactions on Knowledge and Data Engineering.

[46]  Wu He,et al.  Internet of Things in Industries: A Survey , 2014, IEEE Transactions on Industrial Informatics.

[47]  Frank Klawonn,et al.  Evolving Extended Naive Bayes Classifiers , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[48]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[49]  Wilfred Ng,et al.  Maintaining frequent closed itemsets over a sliding window , 2008, Journal of Intelligent Information Systems.

[50]  Lida Xu,et al.  A new type of recurrent fuzzy neural network for modeling dynamic systems , 2001, Knowl. Based Syst..

[51]  Jean Paul Barddal,et al.  Adaptive random forests for data stream regression , 2018, ESANN.

[52]  Li-Chiu Chang,et al.  Reinforced Two-Step-Ahead Weight Adjustment Technique for Online Training of Recurrent Neural Networks , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[53]  João Gama,et al.  A framework to monitor clusters evolution applied to economy and finance problems , 2012, Intell. Data Anal..

[54]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[55]  Jean Paul Barddal,et al.  A Survey on Ensemble Learning for Data Stream Classification , 2017, ACM Comput. Surv..

[56]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[57]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[58]  Geoffrey I. Webb,et al.  Extremely Fast Decision Tree , 2018, KDD.

[59]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[60]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[61]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[62]  Jesse Read,et al.  Data Stream Classification Using Random Feature Functions and Novel Method Combinations , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.

[63]  Chen Lei,et al.  Automated Machine Learning , 2021, Cognitive Intelligence and Robotics.

[64]  Myra Spiliopoulou,et al.  MONIC: modeling and monitoring cluster transitions , 2006, KDD '06.

[65]  Willie Ng,et al.  Discovery of Frequent Patterns in Transactional Data Streams , 2010, Trans. Large Scale Data Knowl. Centered Syst..

[66]  Gianmarco De Francisci Morales,et al.  SAMOA: scalable advanced massive online analysis , 2015, J. Mach. Learn. Res..

[67]  Heiko Wersing,et al.  KNN Classifier with Self Adjusting Memory for Heterogeneous Concept Drift , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[68]  Matthias Carnein,et al.  Optimizing Data Stream Representation: An Extensive Survey on Stream Clustering Algorithms , 2019, Bus. Inf. Syst. Eng..

[69]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[70]  Roberto Souto Maior de Barros,et al.  A Boosting-like Online Learning Ensemble , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[71]  Maroua Bahri,et al.  Improving IoT data stream analytics using summarization techniques. (Amélioration de l'analyse des flux de données IoT à l'aide de techniques de réduction de données) , 2020 .

[72]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[73]  Leszek Rutkowski,et al.  Stream Data Mining: Algorithms and Their Probabilistic Properties , 2019, Studies in Big Data.

[74]  Chee Peng Lim,et al.  A review of online learning in supervised neural networks , 2014, Neural Computing and Applications.

[75]  Talel Abdessalem,et al.  Adaptive random forests for evolving data stream classification , 2017, Machine Learning.

[76]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[77]  João Gama,et al.  Self Hyper-Parameter Tuning for Data Streams , 2018, DS.

[78]  Ranjan Kumar Behera,et al.  A Comparative Study of Distributed Tools for Analyzing Streaming Data , 2017, 2017 International Conference on Information Technology (ICIT).

[79]  Chee Peng Lim,et al.  A randomized neural network for data streams , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[80]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[81]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[82]  Boguslaw Cyganek,et al.  Learning Decision Trees from Data Streams with Concept Drift , 2016, ICCS.

[83]  Philip S. Yu,et al.  A Survey of Synopsis Construction in Data Streams , 2007, Data Streams - Models and Algorithms.

[84]  Saso Dzeroski,et al.  Learning model trees from evolving data streams , 2010, Data Mining and Knowledge Discovery.

[85]  Silviu Maniu,et al.  StreamDM: Advanced Data Mining in Spark Streaming , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[86]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[87]  Albert Bifet,et al.  DATA STREAM MINING A Practical Approach , 2009 .

[88]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[89]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[90]  Haipeng Luo,et al.  Optimal and Adaptive Algorithms for Online Boosting , 2015, ICML.

[91]  Xue-wen Chen,et al.  Big Data Deep Learning: Challenges and Perspectives , 2014, IEEE Access.

[92]  Leszek Rutkowski,et al.  Probabilistic Neural Networks for the Streaming Data Classification , 2020 .

[93]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[94]  João Gama,et al.  Clustering distributed sensor data streams using local processing and reduced communication , 2011, Intell. Data Anal..

[95]  Geoff Holmes,et al.  Leveraging Bagging for Evolving Data Streams , 2010, ECML/PKDD.

[96]  João Gama,et al.  A local algorithm to approximate the global clustering of streams generated in ubiquitous sensor networks , 2018, Int. J. Distributed Sens. Networks.

[97]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[98]  Aaron Klein,et al.  Efficient and Robust Automated Machine Learning , 2015, NIPS.

[99]  Albert Bifet,et al.  Efficient Online Evaluation of Big Data Stream Classifiers , 2015, KDD.

[100]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[101]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[102]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[103]  E. B. Priyanka,et al.  Integrating IoT with LQR-PID controller for online surveillance and control of flow and pressure in fluid transportation system , 2020, J. Ind. Inf. Integr..

[104]  Ira Assent,et al.  The ClusTree: indexing micro-clusters for anytime stream mining , 2011, Knowledge and Information Systems.

[105]  João Gama,et al.  A survey on learning from data streams: current and future trends , 2012, Progress in Artificial Intelligence.

[106]  Albert Bifet,et al.  Delayed labelling evaluation for data streams , 2019, Data Mining and Knowledge Discovery.