A Sketch-Based Naive Bayes Algorithms for Evolving Data Streams

A well-known learning task in big data stream mining is classification. Extensively studied in the offline setting, in the streaming setting – where data are evolving and even infinite – it is still a challenge. In the offline setting, training needs to store all the data in memory for the learning task; yet, in the streaming setting, this is impossible to do due to the massive amount of data that is generated in real-time. To cope with these resource issues, this paper proposes and analyzes several evolving naive Bayes classification algorithms, based on the well-known count-min sketch, in order to minimize the space needed to store the training data. The proposed algorithms also adapt concept drift approaches, such as ADWIN, to deal with the fact that streaming data may be evolving and change over time. However, handling sparse, very high-dimensional data in such framework is highly challenging. Therefore, we include the hashing trick, a technique for dimensionality reduction, to compress that down to a lower dimensional space, which leads to a large memory saving.We give a theoretical analysis which demonstrates that our proposed algorithms provide a similar accuracy quality to the classical big data stream mining algorithms using a reasonable amount of resources. We validate these theoretical results by an extensive evaluation on both synthetic and real-world datasets.

[1]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[2]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[3]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[4]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[5]  Talel Abdessalem,et al.  Adaptive random forests for evolving data stream classification , 2017, Machine Learning.

[6]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[7]  João Gama,et al.  Decision trees for mining data streams , 2006, Intell. Data Anal..

[8]  Elias Oliveira,et al.  Agglomeration and Elimination of Terms for Dimensionality Reduction , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[9]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[10]  Geoff Holmes,et al.  Leveraging Bagging for Evolving Data Streams , 2010, ECML/PKDD.

[11]  Frank Klawonn,et al.  Evolving Extended Naive Bayes Classifiers , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[12]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[13]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[14]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[15]  Vincent Lemaire,et al.  Incremental Weighted Naive Bays Classifiers for Data Stream , 2013, ECDA.

[16]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[17]  S. Muthukrishnan,et al.  Graphical Model Sketch , 2016, ECML/PKDD.

[18]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[19]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[20]  João Gama,et al.  Accurate decision trees for mining high-speed data streams , 2003, KDD '03.

[21]  Graham Cormode,et al.  Data sketching , 2017, Commun. ACM.

[22]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[23]  Geoff Holmes,et al.  Efficient data stream classification via probabilistic adaptive windows , 2013, SAC '13.

[24]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[25]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[26]  Albert Bifet,et al.  MACHINE LEARNING FOR DATA STREAMS , 2018 .

[27]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[28]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .