Large scale data mining using genetics-based machine learning

We are living in the peta-byte era. We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes and/or high dimensionality producing human-understandable solutions are key on several domain areas. Genetics-based machine learning (GBML) techniques are perfect candidates for this task. Recent advances in representations, learning paradigms, and theoretical modelling have showed the competitiveness of non EC techniques in herding large scale data analysis. If evolutionary learning techniques aspire to be a relevant player in this context, they need to have the capacity of processing these vast amounts of data and they need to process this data within reasonable time. Moreover, massive computation cycles are getting cheaper and cheaper every day, allowing researchers to have access to unprecedented computational resources on the edge of petascale computing. Several topics are interlaced in these two requirements: (1) having the proper learning paradigms and knowledge representations, (2) understanding them and knowing when are they suitable for the problem at hand, (3) using efficiency enhancement techniques, and (4) transforming and visualizing the produced solutions to give back as much insight as possible to the domain experts are few of them. This tutorial will try to shed light to the above mentioned questions, following a roadmap that starts exploring what large scale means, and why large is a challenge and opportunity for GBML methods. As we will show later, opportunity has multiple facets: Efficiency enhancement techniques, representations able to cope with large dimensionality spaces, scalability of learning paradigms, and alternative programming models, each of them helping to make GBML very attractive for large-scale data mining. Given these building blocks, we will continue to unfold how we can model the scalability of the components of GBML systems targeting a better engineering effort that will make embracing large datasets routine. Finally, we will illustrate how all these ideas fit by reviewing real applications of GBML systems and what further directions will require serious consideration.

[1]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[2]  Jaume Bacardit,et al.  Performance and Efficiency of Memetic Pittsburgh Learning Classifier Systems , 2009, Evolutionary Computation.

[3]  Osvaldo Graña,et al.  Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8 , 2009, Proteins.

[4]  Wolfgang Banzhaf,et al.  Linear genetic programming GPGPU on Microsoft’s Xbox 360 , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[5]  C. Branden,et al.  Introduction to protein structure , 1991 .

[6]  Raghavendra D. Prabhu,et al.  SOMGPU: An unsupervised pattern classifier on Graphical Processing Unit , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[7]  Hussein A. Abbass,et al.  An adaptive genetic-based signature learning system for intrusion detection , 2009, Expert Syst. Appl..

[8]  Xavier Llorá,et al.  Sigevolution Newsletter of the Acm Special Interest Group on Genetic and Evolutionary Computation in This Issue Ec @ Dow Chemical E2k: Evolution to Knowledge Gecco-2006 Highlights , 2022 .

[9]  Will N. Browne,et al.  Investigating scaling of an abstracted LCS utilising ternary and s-expression alphabets , 2007, GECCO '07.

[10]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[11]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[12]  Larry Bull,et al.  Applications of Learning Classifier Systems , 2004 .

[13]  Xavier Llorà,et al.  Scaling eCGA model building via data-intensive computing , 2010, IEEE Congress on Evolutionary Computation.

[14]  Nicolas Lachiche,et al.  Coarse grain parallelization of evolutionary algorithms on GPGPU cards with EASEA , 2009, GECCO.

[15]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Alex Alves Freitas,et al.  Hierarchical classification of protein function with ensembles of rules and particle swarm optimisation , 2008, Soft Comput..

[17]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[18]  Peter K. Sharpe,et al.  Efficient GA Based Techniques for Classification , 1999, Applied Intelligence.

[19]  Martin V. Butz,et al.  Function Approximation With XCS: Hyperellipsoidal Conditions, Recursive Least Squares, and Compaction , 2008, IEEE Transactions on Evolutionary Computation.

[20]  Larry Bull,et al.  Learning Classifier Systems in Data Mining , 2008, Learning Classifier Systems in Data Mining.

[21]  Jaume Bacardit,et al.  A mixed discrete-continuous attribute list representation for large scale classification domains , 2009, GECCO '09.

[22]  Xavier Llorà,et al.  Knowledge-independent data mining with fine-grained parallel evolutionary algorithms , 2001 .

[23]  Ryszard S. Michalski,et al.  Selecting Examples for Partial Memory Learning , 2000, Machine Learning.

[24]  Erick Cantú-Paz,et al.  Efficient and Accurate Parallel Genetic Algorithms , 2000, Genetic Algorithms and Evolutionary Computation.

[25]  David E. Goldberg,et al.  Substructrual surrogates for learning decomposable classification problems: implementation and first results , 2007, GECCO '07.

[26]  Lakhmi C. Jain,et al.  Evolutionary computation in data mining , 2005 .

[27]  David E. Goldberg,et al.  Evaluation relaxation using substructural information and linear estimation , 2006, GECCO '06.

[28]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[29]  Martin V. Butz,et al.  Data Mining in Learning Classifier Systems: Comparing XCS with GAssist , 2005, IWLCS.

[30]  Jorge Casillas,et al.  Learning consistent, complete and compact sets of fuzzy rules in conjunctive normal form for regression problems , 2008, Soft Comput..

[31]  Pier Luca Lanzi,et al.  An approach to analyze the evolution of symbolic conditions in learning classifier systems , 2007, GECCO '07.

[32]  Daniele Loiacono,et al.  Speeding Up Matching in Learning Classifier Systems Using CUDA , 2009, IWLCS.

[33]  D. Goldberg,et al.  BOA: the Bayesian optimization algorithm , 1999 .

[34]  Concha Bielza,et al.  A review of estimation of distribution algorithms in bioinformatics , 2008, BioData Mining.

[35]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[36]  Martin V. Butz,et al.  Speeding-Up Pittsburgh Learning Classifier Systems: Modeling Time and Accuracy , 2004, PPSN.

[37]  Ester Bernadó-Mansilla,et al.  Genetic-based machine learning systems are competitive for pattern recognition , 2008, Evol. Intell..

[38]  Daniele Loiacono,et al.  Support vector regression for classifier prediction , 2007, GECCO '07.

[39]  Siddhartha Bhattacharyya,et al.  Genetic programming in classifying large-scale data: an ensemble method , 2004, Inf. Sci..

[40]  Burkhard Rost,et al.  PROFcon: novel prediction of long-range contacts , 2005, Bioinform..

[41]  Xavier Llorà,et al.  Meandre: Semantic-Driven Data-Intensive Flows in the Clouds , 2008, 2008 IEEE Fourth International Conference on eScience.

[42]  Jesús S. Aguilar-Ruiz,et al.  Knowledge-based fast evaluation for evolutionary learning , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[43]  Xavier Llorà,et al.  The compact classifier system: scalability analysis and first results , 2005, 2005 IEEE Congress on Evolutionary Computation.

[44]  Francisco Herrera,et al.  Stratified prototype selection based on a steady-state memetic algorithm: a study of scalability , 2010, Memetic Comput..

[45]  Ouen Pinngern,et al.  Towards clustering with XCS , 2007, GECCO '07.

[46]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[47]  Matthew Studley,et al.  Learning Classifier System Ensembles With Rule-Sharing , 2007, IEEE Transactions on Evolutionary Computation.

[48]  Patrice Y. Simard,et al.  Using GPUs for machine learning algorithms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[49]  John H. Holland,et al.  COGNITIVE SYSTEMS BASED ON ADAPTIVE ALGORITHMS1 , 1978 .

[50]  Xavier Llorà,et al.  Fast rule matching for learning classifier systems via vector instructions , 2006, GECCO '06.

[51]  Malcolm I. Heywood,et al.  Training genetic programming on half a million patterns: an example from anomaly detection , 2005, IEEE Transactions on Evolutionary Computation.

[52]  Francisco Herrera,et al.  Subgroup discover in large size data sets preprocessed using stratified instance selection for increasing the presence of minority classes , 2008, Pattern Recognit. Lett..

[53]  Enrique Alba,et al.  Parallel Metaheuristics: A New Class of Algorithms , 2005 .

[54]  Jaume Bacardit,et al.  Empirical Evaluation of Ensemble Techniques for a Pittsburgh Learning Classifier System , 2007, IWLCS.

[55]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[56]  Larry Bull,et al.  Learning Classifier Systems , 2002, Annual Conference on Genetic and Evolutionary Computation.

[57]  Martin V. Butz,et al.  An analysis of matching in learning classifier systems , 2008, GECCO '08.

[58]  Martin V. Butz,et al.  Gradient descent methods in learning classifier systems: improving XCS performance in multistep problems , 2005, IEEE Transactions on Evolutionary Computation.

[59]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[60]  Martin V. Butz,et al.  Rule-Based Evolutionary Online Learning Systems - A Principled Approach to LCS Analysis and Design , 2006, Studies in Fuzziness and Soft Computing.

[61]  Edmund K. Burke,et al.  Improving the scalability of rule-based evolutionary learning , 2009, Memetic Comput..

[62]  Drew Mellor,et al.  A population-based approach to finding the matchset of a learning classifier system efficiently , 2009, GECCO.

[63]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[64]  Larry Bull,et al.  A Memetic Learning Classifier System for Describing Continuous-Valued Problem Spaces , 2005 .

[65]  Xavier Llorà,et al.  Do not match, inherit: fitness surrogates for genetics-based machine learning techniques , 2007, GECCO '07.

[66]  Cressey Daniel Physicists brace themselves for LHC 'data avalanche' , 2008 .

[67]  Rajkumar Buyya,et al.  MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms , 2008, 2008 IEEE Fourth International Conference on eScience.

[68]  Martin V. Butz,et al.  Hyper-ellipsoidal conditions in XCS: rotation, linear approximation, and solution structure , 2006, GECCO '06.

[69]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[70]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[71]  Giandomenico Spezzano,et al.  GP ensembles for large-scale data classification , 2006, IEEE Transactions on Evolutionary Computation.

[72]  Johannes Fürnkranz,et al.  Integrative Windowing , 1998, J. Artif. Intell. Res..

[73]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[74]  Jaume Bacardit Peñarroya Pittsburgh genetic-based machine learning in the data mining era: representations, generalization, and run-time , 2004 .

[75]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[76]  Albert Orriols Puig New Challenges in Learning Classifier Systems: Mining Rarities and Evolving Fuzzy Models , 2008 .

[77]  Michael C. Lee,et al.  Computer-aided diagnosis of pulmonary nodules using a two-step approach for feature selection and classifier ensemble construction , 2010, Artif. Intell. Medicine.

[78]  Dr. Alex A. Freitas Data Mining and Knowledge Discovery with Evolutionary Algorithms , 2002, Natural Computing Series.

[79]  Griffin Caprio,et al.  Parallel Metaheuristics , 2008, IEEE Distributed Systems Online.

[80]  Alfonso Valencia,et al.  Automated Alphabet Reduction for Protein Datasets , 2009, BMC Bioinformatics.

[81]  Xavier Llorà Data-intensive computing for competent genetic algorithms: a pilot study using meandre , 2009, GECCO '09.

[82]  Martin V. Butz,et al.  Automated Global Structure Extraction for Effective Local Building Block Processing in XCS , 2006, Evolutionary Computation.

[83]  Peter Ross,et al.  Dynamic Training Subset Selection for Supervised Learning in Genetic Programming , 1994, PPSN.

[84]  John J. Grefenstette,et al.  Lamarckian Learning in Multi-Agent Environments , 1991, ICGA.

[85]  Jaume Bacardit,et al.  Speeding up the evaluation of evolutionary learning systems using GPGPUs , 2010, GECCO '10.

[86]  Xavier Llorà,et al.  Toward routine billion-variable optimization using genetic algorithms , 2007, Complex..

[87]  Hussein A. Abbass,et al.  DXCS: an XCS system for distributed data mining , 2005, GECCO '05.

[88]  Jacek Blazewicz,et al.  Coordination number prediction using learning classifier systems: performance and interpretability , 2006, GECCO '06.

[89]  Xavier Llorà,et al.  Automated alphabet reduction method with evolutionary algorithms for protein structure prediction , 2007, GECCO '07.

[90]  Alex Alves Freitas,et al.  A hierarchical multi-label classification ant colony algorithm for protein function prediction , 2010, Memetic Comput..

[91]  Gilles Venturini,et al.  SIA: A Supervised Inductive Algorithm with Genetic Search for Learning Attributes based Concepts , 1993, ECML.

[92]  David E. Goldberg,et al.  The compact genetic algorithm , 1999, IEEE Trans. Evol. Comput..

[93]  V. Reinke,et al.  Genome-wide analysis of developmental and sex-regulated gene expression profiles in Caenorhabditis elegans. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[94]  Georges R. Harik,et al.  Finding Multimodal Solutions Using Restricted Tournament Selection , 1995, ICGA.

[95]  William B. Langdon,et al.  Fitness Causes Bloat in Variable Size Representations , 1997 .

[96]  Xavier Llorà,et al.  Linkage Learning, Rule Representation, and the X-Ary Extended Compact Classifier System , 2008, IWLCS.

[97]  Pierre Baldi,et al.  The Principled Design of Large-Scale Recursive Neural Network Architectures--DAG-RNNs and the Protein Structure Prediction Problem , 2003, J. Mach. Learn. Res..

[98]  James Smith,et al.  A tutorial for competent memetic algorithms: model, taxonomy, and design issues , 2005, IEEE Transactions on Evolutionary Computation.

[99]  Gregorio Martínez Pérez,et al.  Intrusion detection using a linguistic hedged fuzzy-XCS classifier system , 2008, Soft Comput..

[100]  David E. Goldberg,et al.  Genetic Algorithms, Selection Schemes, and the Varying Effects of Noise , 1996, Evolutionary Computation.

[101]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[102]  Xavier Llorà,et al.  Observer-invariant histopathology using genetics-based machine learning , 2009, Natural Computing.

[103]  William B. Langdon,et al.  GP on SPMD parallel graphics hardware for mega Bioinformatics data mining , 2008, Soft Comput..

[104]  Pedro Larrañaga,et al.  Prototype Selection and Feature Subset Selection by Estimation of Distribution Algorithms. A Case Study in the Survival of Cirrhotic Patients Treated with TIPS , 2001, AIME.

[105]  John H. Holland,et al.  Cognitive systems based on adaptive algorithms , 1977, SGAR.

[106]  Jonathan M. Garibaldi,et al.  ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization , 2009, BMC Bioinformatics.

[107]  Tin Kam Ho,et al.  Domain of competence of XCS classifier system in complexity measurement space , 2005, IEEE Transactions on Evolutionary Computation.

[108]  Francisco Herrera,et al.  Genetics-Based Machine Learning for Rule Induction: State of the Art, Taxonomy, and Comparative Study , 2010, IEEE Transactions on Evolutionary Computation.

[109]  Jaume Bacardit,et al.  Prediction of recursive convex hull class assignments for protein residues , 2008, Bioinform..