Large scale data mining using genetics-based machine learning

We are living in the peta-byte era. We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes and/or high dimensionality producing human understandable solutions are key on several domain areas. Genetics-based machine learning (GBML) techniques are perfect candidates for this task. Recent advances in representations, learning paradigms, and theoretical modeling have show the competitiveness of non EC techniques in herding large scale data analysis. If evolutionary learning techniques aspire to be a relevant player in this context, they need to have the capacity of processing these vast amounts of data and they need to process this data within reasonable time. Moreover, massive computation cycles are getting cheaper and cheaper every day, allowing researchers to have access to unprecedented computational resources on the edge of petascale computing. Several topics are interlaced in these two requirements: (1) having the proper learning paradigms and knowledge representations, (2) understanding them and knowing when are they suitable for the problem at hand, (3) using efficiency enhancement techniques, and (4) transforming and visualizing the produced solutions to give back as much insight as possible to the domain experts are few of them. This tutorial will try to shed light to the above mentioned questions, following a roadmap that starts exploring what large scale means, and why large is a challenge and opportunity for GBML methods. As we will show later, opportunity has multiple facets: Efficiency enhancement techniques, representations able to cope with large dimensionality spaces, scalability of learning paradigms, and alternative programming models, each of them helping to make GBML very attractive for large-scale data mining. Given these building blocks, we will continue to unfold how can we model the scalability of the components of GBML systems targeting a better engineering effort that will make embracing large datasets routine. Finally, we will illustrate how all these ideas fit by reviewing real applications of GBML systems and what further directions will require serious consideration.

[1]  Xavier Llorà,et al.  Knowledge-independent data mining with fine-grained parallel evolutionary algorithms , 2001 .

[2]  Michael C. Lee,et al.  Computer-aided diagnosis of pulmonary nodules using a two-step approach for feature selection and classifier ensemble construction , 2010, Artif. Intell. Medicine.

[3]  Pedro Larrañaga,et al.  Prototype Selection and Feature Subset Selection by Estimation of Distribution Algorithms. A Case Study in the Survival of Cirrhotic Patients Treated with TIPS , 2001, AIME.

[4]  John H. Holland,et al.  Cognitive systems based on adaptive algorithms , 1977, SGAR.

[5]  Jonathan M. Garibaldi,et al.  ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization , 2009, BMC Bioinformatics.

[6]  Erick Cantú-Paz,et al.  Efficient and Accurate Parallel Genetic Algorithms , 2000, Genetic Algorithms and Evolutionary Computation.

[7]  David E. Goldberg,et al.  Substructrual surrogates for learning decomposable classification problems: implementation and first results , 2007, GECCO '07.

[8]  Federico Divina,et al.  Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features , 2012, Bioinform..

[9]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[10]  Larry Bull,et al.  A Memetic Learning Classifier System for Describing Continuous-Valued Problem Spaces , 2005 .

[11]  Alfonso Valencia,et al.  Automated Alphabet Reduction for Protein Datasets , 2009, BMC Bioinformatics.

[12]  Jacek Blazewicz,et al.  Coordination number prediction using learning classifier systems: performance and interpretability , 2006, GECCO '06.

[13]  Matthew Studley,et al.  Learning Classifier System Ensembles With Rule-Sharing , 2007, IEEE Transactions on Evolutionary Computation.

[14]  Xavier Llorà,et al.  Automated alphabet reduction method with evolutionary algorithms for protein structure prediction , 2007, GECCO '07.

[15]  Alex Alves Freitas,et al.  A hierarchical multi-label classification ant colony algorithm for protein function prediction , 2010, Memetic Comput..

[16]  Gilles Venturini,et al.  SIA: A Supervised Inductive Algorithm with Genetic Search for Learning Attributes based Concepts , 1993, ECML.

[17]  David E. Goldberg,et al.  The compact genetic algorithm , 1999, IEEE Trans. Evol. Comput..

[18]  Martin V. Butz,et al.  An analysis of matching in learning classifier systems , 2008, GECCO '08.

[19]  Martin V. Butz,et al.  Gradient descent methods in learning classifier systems: improving XCS performance in multistep problems , 2005, IEEE Transactions on Evolutionary Computation.

[20]  B. Rost,et al.  Critical assessment of methods of protein structure prediction—Round VIII , 2009, Proteins.

[21]  David B. Allison,et al.  DNA Microarrays and Related Genomics Techniques : Design, Analysis, and Interpretation of Experiments , 2005 .

[22]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[23]  V. Reinke,et al.  Genome-wide analysis of developmental and sex-regulated gene expression profiles in Caenorhabditis elegans. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Georges R. Harik,et al.  Finding Multimodal Solutions Using Restricted Tournament Selection , 1995, ICGA.

[25]  Jaume Bacardit,et al.  Modelling the initialisation stage of the ALKR representation for discrete domains and GABIL encoding , 2011, GECCO '11.

[26]  Larry Bull,et al.  Learning Classifier Systems in Data Mining , 2008, Learning Classifier Systems in Data Mining.

[27]  Xavier Llorà Data-intensive computing for competent genetic algorithms: a pilot study using meandre , 2009, GECCO '09.

[28]  Jaume Bacardit,et al.  A mixed discrete-continuous attribute list representation for large scale classification domains , 2009, GECCO '09.

[29]  Jorge Casillas,et al.  Learning consistent, complete and compact sets of fuzzy rules in conjunctive normal form for regression problems , 2008, Soft Comput..

[30]  Pier Luca Lanzi,et al.  An approach to analyze the evolution of symbolic conditions in learning classifier systems , 2007, GECCO '07.

[31]  Xavier Llorà,et al.  Inducing Partially-Defined Instances with Evolutionary Algorithms , 2001, ICML.

[32]  Ryszard S. Michalski,et al.  Selecting Examples for Partial Memory Learning , 2000, Machine Learning.

[33]  Malcolm I. Heywood,et al.  Training genetic programming on half a million patterns: an example from anomaly detection , 2005, IEEE Transactions on Evolutionary Computation.

[34]  Francisco Herrera,et al.  Subgroup discover in large size data sets preprocessed using stratified instance selection for increasing the presence of minority classes , 2008, Pattern Recognit. Lett..

[35]  D. Goldberg,et al.  BOA: the Bayesian optimization algorithm , 1999 .

[36]  Concha Bielza,et al.  A review of estimation of distribution algorithms in bioinformatics , 2008, BioData Mining.

[37]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[38]  Martin V. Butz,et al.  Automated Global Structure Extraction for Effective Local Building Block Processing in XCS , 2006, Evolutionary Computation.

[39]  Jaume Bacardit,et al.  Post-processing operators for decision lists , 2012, GECCO '12.

[40]  Peter Ross,et al.  Dynamic Training Subset Selection for Supervised Learning in Genetic Programming , 1994, PPSN.

[41]  Alex Alves Freitas,et al.  Hierarchical classification of protein function with ensembles of rules and particle swarm optimisation , 2008, Soft Comput..

[42]  Albert Orriols Puig New Challenges in Learning Classifier Systems: Mining Rarities and Evolving Fuzzy Models , 2008 .

[43]  Tim Kovacs,et al.  Genetics-Based Machine Learning , 2012, Handbook of Natural Computing.

[44]  Peter K. Sharpe,et al.  Efficient GA Based Techniques for Classification , 1999, Applied Intelligence.

[45]  Enrique Alba,et al.  Parallel Metaheuristics: A New Class of Algorithms , 2005 .

[46]  Patrice Y. Simard,et al.  Using GPUs for machine learning algorithms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[47]  Martin V. Butz,et al.  Speeding-Up Pittsburgh Learning Classifier Systems: Modeling Time and Accuracy , 2004, PPSN.

[48]  Ester Bernadó-Mansilla,et al.  Genetic-based machine learning systems are competitive for pattern recognition , 2008, Evol. Intell..

[49]  John J. Grefenstette,et al.  Lamarckian Learning in Multi-Agent Environments , 1991, ICGA.

[50]  Xavier Llorà,et al.  Fast rule matching for learning classifier systems via vector instructions , 2006, GECCO '06.

[51]  Martin Middendorf,et al.  Learning classifier systems to evolve classification rules for systems of memory constrained components , 2011, Evol. Intell..

[52]  Xavier Llorà,et al.  Observer-invariant histopathology using genetics-based machine learning , 2009, Natural Computing.

[53]  Jesús S. Aguilar-Ruiz,et al.  Knowledge-based fast evaluation for evolutionary learning , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[54]  T. Crainic,et al.  Parallel Meta-Heuristics , 2010 .

[55]  Xavier Llorà,et al.  Do not match, inherit: fitness surrogates for genetics-based machine learning techniques , 2007, GECCO '07.

[56]  Martin V. Butz,et al.  Hyper-ellipsoidal conditions in XCS: rotation, linear approximation, and solution structure , 2006, GECCO '06.

[57]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[58]  Jaume Bacardit,et al.  Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets[C][W][OA] , 2011, Plant Cell.

[59]  Xavier Llorá,et al.  Sigevolution Newsletter of the Acm Special Interest Group on Genetic and Evolutionary Computation in This Issue Ec @ Dow Chemical E2k: Evolution to Knowledge Gecco-2006 Highlights , 2022 .

[60]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[61]  Daniele Loiacono,et al.  Support vector regression for classifier prediction , 2007, GECCO '07.

[62]  Siddhartha Bhattacharyya,et al.  Genetic programming in classifying large-scale data: an ensemble method , 2004, Inf. Sci..

[63]  Burkhard Rost,et al.  PROFcon: novel prediction of long-range contacts , 2005, Bioinform..

[64]  Cressey Daniel Physicists brace themselves for LHC 'data avalanche' , 2008 .

[65]  Rajkumar Buyya,et al.  MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms , 2008, 2008 IEEE Fourth International Conference on eScience.

[66]  David E. Goldberg,et al.  Genetic Algorithms, Selection Schemes, and the Varying Effects of Noise , 1996, Evolutionary Computation.

[67]  Johannes Fürnkranz,et al.  Integrative Windowing , 1998, J. Artif. Intell. Res..

[68]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[69]  Jaume Bacardit,et al.  Speeding up the evaluation of evolutionary learning systems using GPGPUs , 2010, GECCO '10.

[70]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[71]  Xavier Llorà,et al.  Meandre: Semantic-Driven Data-Intensive Flows in the Clouds , 2008, 2008 IEEE Fourth International Conference on eScience.

[72]  Xavier Llorà,et al.  Toward routine billion-variable optimization using genetic algorithms , 2007, Complex..

[73]  David E. Goldberg,et al.  Genetic Algorithm Design Inspired by Organizational Theory: Pilot Study of a Dependency Structure Matrix Driven Genetic Algorithm , 2003, GECCO.

[74]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[75]  Hussein A. Abbass,et al.  DXCS: an XCS system for distributed data mining , 2005, GECCO '05.

[76]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction—Round VII , 2007, Proteins.

[77]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[78]  Raghavendra D. Prabhu,et al.  SOMGPU: An unsupervised pattern classifier on Graphical Processing Unit , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[79]  Giandomenico Spezzano,et al.  GP ensembles for large-scale data classification , 2006, IEEE Transactions on Evolutionary Computation.

[80]  Jaume Bacardit Peñarroya Pittsburgh genetic-based machine learning in the data mining era: representations, generalization, and run-time , 2004 .

[81]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[82]  Hussein A. Abbass,et al.  An adaptive genetic-based signature learning system for intrusion detection , 2009, Expert Syst. Appl..

[83]  Will N. Browne,et al.  Investigating scaling of an abstracted LCS utilising ternary and s-expression alphabets , 2007, GECCO '07.

[84]  Xavier Llorà,et al.  Scaling eCGA model building via data-intensive computing , 2010, IEEE Congress on Evolutionary Computation.

[85]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[86]  Tin Kam Ho,et al.  Domain of competence of XCS classifier system in complexity measurement space , 2005, IEEE Transactions on Evolutionary Computation.

[87]  Francisco Herrera,et al.  Genetics-Based Machine Learning for Rule Induction: State of the Art, Taxonomy, and Comparative Study , 2010, IEEE Transactions on Evolutionary Computation.

[88]  Jaume Bacardit,et al.  Prediction of recursive convex hull class assignments for protein residues , 2008, Bioinform..

[89]  William B. Langdon Large Scale Bioinformatics Data Mining with Parallel Genetic Programming on Graphics Processing Units , 2010, Parallel and Distributed Computational Intelligence.

[90]  Xavier Llorà,et al.  Towards billion-bit optimization via a parallel estimation of distribution algorithm , 2007, GECCO '07.

[91]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[92]  Martin V. Butz,et al.  Data Mining in Learning Classifier Systems: Comparing XCS with GAssist , 2005, IWLCS.

[93]  Xavier Llorà,et al.  The compact classifier system: scalability analysis and first results , 2005, 2005 IEEE Congress on Evolutionary Computation.

[94]  Francisco Herrera,et al.  Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability , 2010, Memetic Comput..

[95]  Ouen Pinngern,et al.  Towards clustering with XCS , 2007, GECCO '07.

[96]  William B. Langdon,et al.  GP on SPMD parallel graphics hardware for mega Bioinformatics data mining , 2008, Soft Comput..

[97]  Jaume Bacardit,et al.  Empirical Evaluation of Ensemble Techniques for a Pittsburgh Learning Classifier System , 2007, IWLCS.

[98]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[99]  G. Harik Linkage Learning via Probabilistic Modeling in the ECGA , 1999 .

[100]  Daniele Loiacono,et al.  Speeding Up Matching in Learning Classifier Systems Using CUDA , 2009, IWLCS.

[101]  Xavier Llorà,et al.  Scaling Genetic Algorithms Using MapReduce , 2009, 2009 Ninth International Conference on Intelligent Systems Design and Applications.

[102]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[103]  David E. Goldberg,et al.  Evaluation relaxation using substructural information and linear estimation , 2006, GECCO '06.

[104]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[105]  Adam Prügel-Bennett,et al.  Evolving Fisher Kernels for Biological Sequence Classification , 2013, Evolutionary Computation.

[106]  Jaume Bacardit,et al.  Performance and Efficiency of Memetic Pittsburgh Learning Classifier Systems , 2009, Evolutionary Computation.

[107]  Osvaldo Graña,et al.  Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8 , 2009, Proteins.

[108]  A. Tramontano,et al.  Evaluation of residue–residue contact predictions in CASP9 , 2011, Proteins.

[109]  Martin V. Butz,et al.  Rule-Based Evolutionary Online Learning Systems - A Principled Approach to LCS Analysis and Design , 2006, Studies in Fuzziness and Soft Computing.

[110]  Edmund K. Burke,et al.  Improving the scalability of rule-based evolutionary learning , 2009, Memetic Comput..

[111]  Tim Kovacs,et al.  Applications of Learning Classifier Systems , 2004 .

[112]  Drew Mellor,et al.  A population-based approach to finding the matchset of a learning classifier system efficiently , 2009, GECCO.

[113]  Damon L. Woodard,et al.  SSGA & EDA based feature selection and weighting for face recognition , 2011, 2011 IEEE Congress of Evolutionary Computation (CEC).

[114]  Wolfgang Banzhaf,et al.  Linear genetic programming GPGPU on Microsoft’s Xbox 360 , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[115]  Xavier Llorà,et al.  When Huge Is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing , 2010, Parallel and Distributed Computational Intelligence.

[116]  Jason H. Moore,et al.  Instance-linked attribute tracking and feedback for michigan-style supervised learning classifier systems , 2012, GECCO '12.

[117]  Martin V. Butz,et al.  Function Approximation With XCS: Hyperellipsoidal Conditions, Recursive Least Squares, and Compaction , 2008, IEEE Transactions on Evolutionary Computation.

[118]  Nicolas Lachiche,et al.  Coarse grain parallelization of evolutionary algorithms on GPGPU cards with EASEA , 2009, GECCO.

[119]  Kerstin Eder,et al.  XCS cannot learn all boolean functions , 2011, GECCO '11.

[120]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[121]  Arthur M. Lesk,et al.  Introduction to protein architecture : the structural biologyof proteins , 2001 .

[122]  María José del Jesús,et al.  Evolutionary and metaheuristics based data mining , 2009, Soft Comput..

[123]  William B. Langdon,et al.  Fitness Causes Bloat in Variable Size Representations , 1997 .

[124]  Xavier Llorà,et al.  Linkage Learning, Rule Representation, and the X-Ary Extended Compact Classifier System , 2008, IWLCS.

[125]  Pierre Baldi,et al.  The Principled Design of Large-Scale Recursive Neural Network Architectures--DAG-RNNs and the Protein Structure Prediction Problem , 2003, J. Mach. Learn. Res..

[126]  James Smith,et al.  A tutorial for competent memetic algorithms: model, taxonomy, and design issues , 2005, IEEE Transactions on Evolutionary Computation.

[127]  Gregorio Martínez Pérez,et al.  Intrusion detection using a linguistic hedged fuzzy-XCS classifier system , 2008, Soft Comput..