Evolving GP Classifiers for Streaming Data Tasks with Concept Change and Label Budgets: A Benchmarking Study

Streaming data classification requires that several additional challenges are addressed that are not typically encountered in offline supervised learning formulations. Specifically, access to data at any training generation is limited to a small subset of the data, and the data itself is potentially generated by a non-stationary process. Moreover, there is a cost to requesting labels, thus a label budget is enforced. Finally, an anytime classification requirement implies that it must be possible to identify a ‘champion’ classifier for predicting labels as the stream progresses. In this work, we propose a general framework for deploying genetic programming (GP) to streaming data classification under these constraints. The framework consists of a sampling policy and an archiving policy that enforce criteria for selecting data to appear in a data subset. Only the exemplars of the data subset are labeled, and it is the content of the data subset that training epochs are performed against. Specific recommendations include support for GP task decomposition/modularity and making additional training epochs per data subset. Both recommendations make significant improvements to the baseline performance of GP under streaming data with label budgets. Benchmarking issues addressed include the identification of datasets and performance measures.

[1]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[2]  Malcolm I. Heywood,et al.  Symbiosis, complexification and simplicity under GP , 2010, GECCO '10.

[3]  Brian Mac Namee,et al.  Handling Concept Drift in a Text Data Stream Constrained by High Labelling Cost , 2010, FLAIRS.

[4]  M. Harries SPLICE-2 Comparative Evaluation: Electricity Pricing , 1999 .

[5]  Gavin Brown,et al.  Theoretical and empirical analysis of diversity in non-stationary learning , 2011, 2011 IEEE Symposium on Computational Intelligence in Dynamic and Uncertain Environments (CIDUE).

[6]  Yisheng Dong,et al.  An active learning system for mining time-changing data streams , 2007, Intell. Data Anal..

[7]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[8]  Uri Alon,et al.  Varying environments can speed up evolution , 2007, Proceedings of the National Academy of Sciences.

[9]  Gregory Ditzler,et al.  Hellinger distance based drift detection for nonstationary environments , 2011, 2011 IEEE Symposium on Computational Intelligence in Dynamic and Uncertain Environments (CIDUE).

[10]  Carsten Lanquillon Information Filtering in Changing Domains , 1999, IJCAI 1999.

[11]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[12]  João Gama,et al.  A survey on learning from data streams: current and future trends , 2012, Progress in Artificial Intelligence.

[13]  Geoff Holmes,et al.  Active Learning with Evolving Streaming Data , 2011, ECML/PKDD.

[14]  Gianluigi Folino,et al.  Handling Different Categories of Concept Drifts in Data Streams Using Distributed GP , 2010, EuroGP.

[15]  Geoff Holmes,et al.  Active Learning With Drifting Streaming Data , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[17]  Andrew R. McIntyre,et al.  Tapped Delay Lines for GP Streaming Data Classification with Label Budgets , 2015, EuroGP.

[18]  Brian Mac Namee,et al.  Drift Detection Using Uncertainty Distribution Divergence , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[19]  Günter P. Wagner,et al.  Complex Adaptations and the Evolution of Evolvability , 2005 .

[20]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[21]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.

[22]  Xin Yao,et al.  The Impact of Diversity on Online Ensemble Learning in the Presence of Concept Drift , 2010, IEEE Transactions on Knowledge and Data Engineering.

[23]  Malcolm I. Heywood Evolutionary model building under streaming data for classification tasks: opportunities and challenges , 2014, Genetic Programming and Evolvable Machines.

[24]  Albert Bifet,et al.  Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams , 2010, Frontiers in Artificial Intelligence and Applications.

[25]  Edwin D de Jong A monotonic archive for pareto-coevolution. , 2007, Evolutionary computation.

[26]  Philip S. Yu,et al.  Active Mining of Data Streams , 2004, SDM.

[27]  Gavin Brown,et al.  "Good" and "Bad" Diversity in Majority Vote Ensembles , 2010, MCS.

[28]  Abraham Bernstein,et al.  Entropy-based Concept Shift Detection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[29]  Tim French,et al.  Online learning classifiers in dynamic environments with incomplete feedback , 2013, 2013 IEEE Congress on Evolutionary Computation.

[30]  Ingrid Renz,et al.  Adaptive information filtering: detecting changes in text streams , 1999, CIKM '99.

[31]  Andrew R. McIntyre,et al.  On the application of GP to streaming data classification tasks with label budgets , 2014, GECCO.

[32]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[33]  Jürgen Branke,et al.  Multiswarms, exclusion, and anti-convergence in dynamic environments , 2006, IEEE Transactions on Evolutionary Computation.

[34]  João Gama,et al.  Change Detection in Learning Histograms from Data Streams , 2007, EPIA Workshops.

[35]  Andrew R. McIntyre,et al.  Symbiotic coevolutionary genetic programming: a benchmarking study under large attribute spaces , 2012, Genetic Programming and Evolvable Machines.

[36]  Malcolm I. Heywood,et al.  Hierarchical task decomposition through symbiosis in reinforcement learning , 2012, GECCO '12.

[37]  Malcolm I. Heywood,et al.  Benchmarking pareto archiving heuristics in the presence of concept drift: diversity versus age , 2013, GECCO '13.

[38]  Xiaodong Lin,et al.  Active Learning From Stream Data Using Optimal Weight Classifier Ensemble , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[39]  Malcolm I. Heywood,et al.  GP under streaming data constraints: a case for pareto archiving? , 2012, GECCO '12.

[40]  Geoff Holmes,et al.  Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them , 2013, ECML/PKDD.

[41]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[42]  Anthony Brabazon,et al.  Foundations in Grammatical Evolution for Dynamic Environments , 2009, Studies in Computational Intelligence.

[43]  Malcolm I. Heywood,et al.  Managing team-based problem solving with symbiotic bid-based genetic programming , 2008, GECCO '08.

[44]  Merav Parter,et al.  Facilitated Variation: How Evolution Learns from Past Environments To Generalize to New Environments , 2008, PLoS Comput. Biol..