ExSTraCS 2.0: description and evaluation of a scalable learning classifier system

Algorithmic scalability is a major concern for any machine learning strategy in this age of ‘big data’. A large number of potentially predictive attributes is emblematic of problems in bioinformatics, genetic epidemiology, and many other fields. Previously, ExSTraCS was introduced as an extended Michigan-style supervised learning classifier system that combined a set of powerful heuristics to successfully tackle the challenges of classification, prediction, and knowledge discovery in complex, noisy, and heterogeneous problem domains. While Michigan-style learning classifier systems are powerful and flexible learners, they are not considered to be particularly scalable. For the first time, this paper presents a complete description of the ExSTraCS algorithm and introduces an effective strategy to dramatically improve learning classifier system scalability. ExSTraCS 2.0 addresses scalability with (1) a rule specificity limit, (2) new approaches to expert knowledge guided covering and mutation mechanisms, and (3) the implementation and utilization of the TuRF algorithm for improving the quality of expert knowledge discovery in larger datasets. Performance over a complex spectrum of simulated genetic datasets demonstrated that these new mechanisms dramatically improve nearly every performance metric on datasets with 20 attributes and made it possible for ExSTraCS to reliably scale up to perform on related 200 and 2000-attribute datasets. ExSTraCS 2.0 was also able to reliably solve the 6, 11, 20, 37, 70, and 135 multiplexer problems, and did so in similar or fewer learning iterations than previously reported, with smaller finite training sets, and without using building blocks discovered from simpler multiplexer problems. Furthermore, ExSTraCS usability was made simpler through the elimination of previously critical run parameters.

[1]  Kenneth A. De Jong,et al.  Learning Concept Classification Rules Using Genetic Algorithms , 1991, IJCAI.

[2]  Jason H. Moore,et al.  BIOINFORMATICS REVIEW , 2005 .

[3]  Xavier Llorà,et al.  Large‐scale data mining using genetics‐based machine learning , 2013, GECCO.

[4]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[5]  Jason H. Moore,et al.  Tuning ReliefF for Genome-Wide Genetic Analysis , 2007, EvoBIO.

[6]  Xavier Llorà,et al.  Automated alphabet reduction method with evolutionary algorithms for protein structure prediction , 2007, GECCO '07.

[7]  Jason H. Moore,et al.  Instance-linked attribute tracking and feedback for michigan-style supervised learning classifier systems , 2012, GECCO '12.

[8]  Jason H. Moore,et al.  Multiple Threshold Spatially Uniform ReliefF for the Genetic Analysis of Complex Human Diseases , 2013, EvoBIO.

[9]  Jason H. Moore,et al.  Rapid Rule Compaction Strategies for Global Knowledge Discovery in a Supervised Learning Classifier System , 2013, ECAL.

[10]  Jason H. Moore,et al.  Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection , 2012, BioData Mining.

[11]  Martin V. Butz,et al.  Toward a theory of generalization and learning in XCS , 2004, IEEE Transactions on Evolutionary Computation.

[12]  Martin V. Butz,et al.  Speeding-Up Pittsburgh Learning Classifier Systems: Modeling Time and Accuracy , 2004, PPSN.

[13]  Martin V. Butz,et al.  Studying XCS/BOA learning in Boolean functions: structure encoding and random Boolean functions , 2006, GECCO '06.

[14]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[15]  Jason H. Moore,et al.  The Informative Extremes: Using Both Nearest and Farthest Individuals Can Improve Relief Algorithms in the Domain of Human Genetics , 2010, EvoBIO.

[16]  Jason H. Moore,et al.  An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems , 2012, IEEE Computational Intelligence Magazine.

[17]  Jaume Bacardit,et al.  Speeding up the evaluation of evolutionary learning systems using GPGPUs , 2010, GECCO '10.

[18]  Jason H. Moore,et al.  Using Expert Knowledge to Guide Covering and Mutation in a Michigan Style Learning Classifier System to Detect Epistasis and Heterogeneity , 2012, PPSN.

[19]  Martin V. Butz,et al.  Tournament Selection: Stable Fitness Pressure in XCS , 2003, GECCO.

[20]  Xavier Llorà,et al.  Fast rule matching for learning classifier systems via vector instructions , 2006, GECCO '06.

[21]  Mengjie Zhang,et al.  Reusing Building Blocks of Extracted Knowledge to Solve Complex, Large-Scale Boolean Problems , 2014, IEEE Transactions on Evolutionary Computation.

[22]  Jaume Bacardit,et al.  A mixed discrete-continuous attribute list representation for large scale classification domains , 2009, GECCO '09.

[23]  Daniele Loiacono,et al.  Speeding Up Matching in Learning Classifier Systems Using CUDA , 2009, IWLCS.

[24]  Jason H. Moore,et al.  An Extended Michigan-Style Learning Classifier System for Flexible Supervised Learning, Classification, and Data Mining , 2014, PPSN.

[25]  Stewart W. Wilson Classifier Systems and the Animat Problem , 1987, Machine Learning.

[26]  Jaume Bacardit,et al.  Performance and Efficiency of Memetic Pittsburgh Learning Classifier Systems , 2009, Evolutionary Computation.

[27]  Jason H. Moore,et al.  Role of genetic heterogeneity and epistasis in bladder cancer susceptibility and outcome: a learning classifier system approach , 2013, J. Am. Medical Informatics Assoc..

[28]  Martin V. Butz XCS in Binary Classification Problems , 2006 .

[29]  Hitoshi Iba,et al.  Genetic programming using a minimum description length principle , 1994 .

[30]  Jason H. Moore,et al.  A multi-core parallelization strategy for statistical significance testing in learning classifier systems , 2013, Evolutionary Intelligence.

[31]  Ester Bernadó-Mansilla,et al.  Accuracy-Based Learning Classifier Systems: Models, Analysis and Applications to Classification Tasks , 2003, Evolutionary Computation.

[32]  Mengjie Zhang,et al.  Extending learning classifier system with cyclic graphs for scalability on complex, large-scale boolean problems , 2013, GECCO '13.

[33]  D. Haussler,et al.  Boolean Feature Discovery in Empirical Learning , 1990, Machine Learning.

[34]  Matthew Studley,et al.  Learning Classifier System Ensembles With Rule-Sharing , 2007, IEEE Transactions on Evolutionary Computation.

[35]  Jason H Moore,et al.  Epistasis analysis using ReliefF. , 2015, Methods in molecular biology.

[36]  David B. Allison,et al.  Database mining for selection of SNP markers useful in admixture mapping , 2009, BioData Mining.

[37]  Jason H. Moore,et al.  Spatially Uniform ReliefF (SURF) for computationally-efficient filtering of gene-gene interactions , 2009, BioData Mining.

[38]  Jason H. Moore,et al.  Learning classifier systems: a complete introduction, review, and roadmap , 2009 .

[39]  Jason H. Moore,et al.  The Application of Pittsburgh-Style Learning Classifier Systems to Address Genetic Heterogeneity and Epistasis in Association Studies , 2010, PPSN.

[40]  Jason H. Moore,et al.  The application of michigan-style learning classifiersystems to address genetic heterogeneity and epistasisin association studies , 2010, GECCO '10.

[41]  Edmund K. Burke,et al.  Improving the scalability of rule-based evolutionary learning , 2009, Memetic Comput..

[42]  Georgios A. Pavlopoulos,et al.  Caipirini: using gene sets to rank literature , 2012, BioData Mining.

[43]  J. Ross Quinlan,et al.  An Empirical Comparison of Genetic and Decision-Tree Classifiers , 1988, ML.

[44]  Jason H. Moore,et al.  STUDENTJAMA. The challenges of whole-genome approaches to common diseases. , 2004, JAMA.

[45]  P. Anandan,et al.  Cooperativity in Networks of Pattern Recognizing Stochastic Learning Automata , 1986 .

[46]  Martin V. Butz,et al.  Analysis and Improvement of Fitness Exploitation in XCS: Bounding Models, Tournament Selection, and Bilateral Accuracy , 2003, Evolutionary Computation.

[47]  Robert E. Smith,et al.  Is a Learning Classifier System a Type of Neural Network? , 1994, Evolutionary Computation.

[48]  Scott M. Williams,et al.  A balanced accuracy function for epistasis modeling in imbalanced datasets using multifactor dimensionality reduction , 2007, Genetic epidemiology.

[49]  Jason H. Moore,et al.  GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures , 2012, BioData Mining.