Dynamic Correlation-Based Feature Selection for Feature Drifts in Data Streams

Learning from data streams requires efficient algorithms capable of constructing a model according to the arrival of new instances. These data stream learners need a quick and real-time response, but mainly, they must be tailored to adapt to possible changes in the data distribution, a condition known as concept drift. However, recent works have shown that changes of relevant feature subsets over time, called feature drift, may have significant impact in the learning process despite being commonly disregarded until now in the underlying concept of a data stream. To improve the performance of feature drifting data stream classification, in this work we present an algorithm called DCFS (Dynamic Correlation-based Feature Selection) that determines which features are the most important in each moment of a data stream. By implementing an adaptive strategy based on a drift monitor, in this algorithm, a correlation-based feature selection method is used to update the relevant feature subsets for data streams dynamically. The experimental results demonstrate that implementing our feature selection algorithm inside an incremental and online classifier leads the model to perform well on data stream datasets with feature drift, surpassing in some cases state-of-the-art data streams classifiers.

[1]  Li Wan,et al.  Heterogeneous Ensemble for Feature Drifts in Data Streams , 2012, PAKDD.

[2]  Jean Paul Barddal,et al.  A survey on feature drift adaptation: Definition, benchmark, challenges and future directions , 2017, J. Syst. Softw..

[3]  Wee Keong Ng,et al.  A survey on data stream clustering and classification , 2015, Knowledge and Information Systems.

[4]  Gong Xiu,et al.  An Incremental Bayes Classification Model , 2002 .

[5]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[6]  João Gama,et al.  Random rules from data streams , 2013, SAC '13.

[7]  Jean Paul Barddal,et al.  Iterative subset selection for feature drifting data streams , 2018, SAC.

[8]  Robi Polikar,et al.  Incremental Learning of Concept Drift in Nonstationary Environments , 2011, IEEE Transactions on Neural Networks.

[9]  G. Hommel,et al.  Improvements of General Multiple Test Procedures for Redundant Systems of Hypotheses , 1988 .

[10]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[11]  Marcus A. Maloof,et al.  Dynamic weighted majority: a new ensemble method for tracking concept drift , 2003, Third IEEE International Conference on Data Mining.

[12]  Albert Bifet,et al.  Efficient Online Evaluation of Big Data Stream Classifiers , 2015, KDD.

[13]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[14]  D. H. D. West Updating mean and variance estimates: an improved method , 1979, CACM.

[15]  Talel Abdessalem,et al.  Adaptive random forests for evolving data stream classification , 2017, Machine Learning.

[16]  Jean Paul Barddal,et al.  A Survey on Ensemble Learning for Data Stream Classification , 2017, ACM Comput. Surv..

[17]  Jean Paul Barddal,et al.  Analyzing the Impact of Feature Drifts in Streaming Learning , 2015, ICONIP.

[18]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[19]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[20]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[21]  Francisco Herrera,et al.  A survey on data preprocessing for data stream mining: Current status and future directions , 2017, Neurocomputing.

[22]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[23]  Grigorios Tsoumakas,et al.  Tracking recurring contexts using ensemble classifiers: an application to email filtering , 2009, Knowledge and Information Systems.

[24]  William Nick Street,et al.  A streaming ensemble algorithm (SEA) for large-scale classification , 2001, KDD '01.

[25]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[26]  Ricard Gavaldà,et al.  Adaptive Learning from Evolving Data Streams , 2009, IDA.

[27]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[28]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.