Incremental Ensemble Classifier Addressing Non-stationary Fast Data Streams

Classification of data points in a data stream is a fundamentally different set of challenges than data mining on static data. While streaming data is often placed into the context of "Big Data" (or more specifically "Fast Data") wherein one-pass algorithms are used, true data streams offer additional hurdles due to their dynamic, evolving, and non-stationary nature. During the stream, the available labels (or concepts) often change, and a concept's definition in the feature space can also evolve (or drift) over time. The core issue is that the hidden generative function of the data is not a constant function, but rather evolves over time. This is known as a non-stationary distribution. In this paper, we describe a new approach to using ensembles for stream classification. While the core method is straightforward, it is specifically designed to adapt quickly with very little overhead to the dynamic and evolving nature of data streams generated from non-stationary functions. Our method, M3, is based on a weighted majority ensemble of heterogeneous model types where model weights are updated on-line using Reinforcement Learning techniques. We compare our method with current leading algorithms as implemented in the Massive Online Analysis (MOA) framework using UCI benchmark and synthetic stream generator data sets, and find that our method shows particularly strong gain over the baseline method when ground truth is of limited availability to the classifiers.

[1]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[2]  Gilad Mishne,et al.  Fast data in the era of big data: Twitter's real-time related query suggestion architecture , 2012, SIGMOD '13.

[3]  Lior Rokach,et al.  Pattern Classification Using Ensemble Methods , 2009, Series in Machine Perception and Artificial Intelligence.

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Philip S. Yu,et al.  Mining Concept-Drifting Data Streams , 2010, Data Mining and Knowledge Discovery Handbook.

[6]  Marcus A. Maloof,et al.  Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts , 2007, J. Mach. Learn. Res..

[7]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[8]  Carlo Zaniolo,et al.  Fast and Light Boosting for Adaptive Mining of Data Streams , 2004, PAKDD.

[9]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[10]  Maguelonne Teisseire,et al.  Successes and New Directions in Data Mining , 2007 .

[11]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[12]  Thomas Seidl,et al.  Clustering Performance on Evolving Data Streams: Assessing Algorithms and Evaluation Measures within MOA , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[13]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[14]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[15]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[16]  Davide Anguita,et al.  Human Activity Recognition on Smartphones Using a Multiclass Hardware-Friendly Support Vector Machine , 2012, IWAAL.

[17]  Bhavani M. Thuraisingham,et al.  Classification and Novel Class Detection of Data Streams in a Dynamic Feature Space , 2010, ECML/PKDD.

[18]  Cesare Alippi Learning in Non-stationary Environments , 2014, IJCCI.

[19]  Geoff Holmes,et al.  New Options for Hoeffding Trees , 2007, Australian Conference on Artificial Intelligence.

[20]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[21]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[22]  Geoff Holmes,et al.  Leveraging Bagging for Evolving Data Streams , 2010, ECML/PKDD.

[23]  Mohamed Medhat Gaber,et al.  Data Stream Mining , 2010, Data Mining and Knowledge Discovery Handbook.

[24]  Robi Polikar,et al.  Incremental Learning of Concept Drift in Nonstationary Environments , 2011, IEEE Transactions on Neural Networks.

[25]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..