EML: A Scalable, Transparent Meta-Learning Paradigm for Big Data Applications

The work presented in this chapter is motivated by two important challenges that arise when applying ML techniques to big data applications: the scalability of an ML technique as the training data increases significantly in size, and the transparency (understandability) of the induced models. To address these issues we describe and analyze a meta-learning paradigm, EML, that combines techniques from evolutionary computation and supervised learning to produce a powerful approach for inducing transparent models for big data ML applications.

[1]  Gunnar Rätsch,et al.  ARTS: accurate recognition of transcription starts in human , 2006, ISMB.

[2]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[3]  Lise Getoor,et al.  Features generated for computational splice-site prediction correspond to functional elements , 2007, BMC Bioinformatics.

[4]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[5]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[6]  Michael Q. Zhang,et al.  OSCAR: One-class SVM for accurate recognition of cis-elements , 2007, Bioinform..

[7]  Kenneth A. De Jong,et al.  A Two-Stage Evolutionary Approach for Effective Classification of hypersensitive DNA Sequences , 2011, J. Bioinform. Comput. Biol..

[8]  Kenneth A. De Jong,et al.  An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and Its Application to DNA Splice Site Prediction , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[10]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[11]  Kenneth A. De Jong,et al.  Selecting predictive features for recognition of hypersensitive sites of regulatory genomic sequences with an evolutionary algorithm , 2010, GECCO '10.

[12]  G. Stormo,et al.  Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites , 2005, Nucleic acids research.

[13]  Jens Keilwagen,et al.  De-Novo Discovery of Differentially Abundant Transcription Factor Binding Sites Including Their Positional Preference , 2011, PLoS Comput. Biol..

[14]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[15]  Nikhil R. Pal,et al.  Genetic programming for simultaneous feature selection and classifier design , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[16]  Zbigniew Skolicki,et al.  An analysis of island models in evolutionary computation , 2005, GECCO '05.

[17]  Xavier Llorà,et al.  Large‐scale data mining using genetics‐based machine learning , 2013, GECCO.

[18]  William M. Spears,et al.  Crossover or Mutation? , 1992, FOGA.

[19]  Jens Keilwagen,et al.  Unifying generative and discriminative learning principles , 2009, BMC Bioinformatics.

[20]  Donald E. Brown,et al.  Fast generic selection of features for neural network classifiers , 1992, IEEE Trans. Neural Networks.

[21]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[22]  Burkhard Morgenstern,et al.  On splice site prediction using weight array models: a comparison of smoothing techniques , 2007 .

[23]  Koby Crammer,et al.  Global Discriminative Learning for Higher-Accuracy Computational Gene Prediction , 2007, PLoS Comput. Biol..

[24]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[25]  Edward Y. Chang,et al.  Parallelizing Support Vector Machines on Distributed Computers , 2007, NIPS.

[26]  Marco Muselli,et al.  On convergence properties of pocket algorithm , 1997, IEEE Trans. Neural Networks.

[27]  Jack Sklansky,et al.  A note on genetic algorithms for large-scale feature selection , 1989, Pattern Recognit. Lett..

[28]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[29]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[30]  Glenn Fung,et al.  A Feature Selection Newton Method for Support Vector Machine Classification , 2004, Comput. Optim. Appl..

[31]  Armin Shmilovici,et al.  Identification of transcription factor binding sites with variable-order Bayesian networks , 2005, Bioinform..

[32]  Burkhard Morgenstern,et al.  TICO: a tool for improving predictions of prokaryotic translation initiation sites , 2005, Bioinform..

[33]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[34]  Dongwon Lee,et al.  kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets , 2013, Nucleic Acids Res..

[35]  Xiaoming Xu,et al.  A hybrid genetic algorithm for feature selection wrapper based on mutual information , 2007, Pattern Recognit. Lett..

[36]  Morris A. Swertz,et al.  The MOLGENIS toolkit: rapid prototyping of biosoftware at the push of a button , 2010, BMC Bioinformatics.

[37]  Marco Tomassini,et al.  Spatially Structured Evolutionary Algorithms: Artificial Evolution in Space and Time (Natural Computing Series) , 2005 .

[38]  William Stafford Noble,et al.  Predicting the in vivo signature of human gene regulatory sequence , 2005, ISMB.

[39]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[40]  William B. Langdon,et al.  Genetic Programming for Mining DNA Chip Data from Cancer Patients , 2004, Genetic Programming and Evolvable Machines.

[41]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[42]  Taeho Jo,et al.  Improving Protein Fold Recognition by Deep Learning Networks , 2015, Scientific Reports.

[43]  Rafael Ramírez,et al.  A Genetic Programming Approach to Feature Selection and Classification of Instantaneous Cognitive States , 2009, EvoWorkshops.

[44]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[45]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[46]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[47]  Lakhmi C. Jain,et al.  Nearest neighbor classifier: Simultaneous editing and feature selection , 1999, Pattern Recognit. Lett..

[48]  Ayhan Demiriz,et al.  Exploiting unlabeled data in ensemble methods , 2002, KDD.

[49]  Gunnar Rätsch,et al.  The SHOGUN Machine Learning Toolbox , 2010, J. Mach. Learn. Res..

[50]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[51]  Vasant Honavar,et al.  Discriminatively trained Markov model for sequence classification , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[52]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[53]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[54]  Ray Walshe,et al.  Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach , 2008, BMC Bioinformatics.

[55]  Lise Getoor,et al.  A Feature Generation Algorithm with Applications to Bio- logical Sequence Classification , 2007 .

[56]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[57]  Kenneth A. De Jong,et al.  An Analysis of the Effects of Neighborhood Size and Shape on Local Selection Algorithms , 1996, PPSN.

[58]  Christopher J. C. Burges,et al.  Scaling Up Machine Learning: Large-Scale Learning to Rank Using Boosted Decision Trees , 2011 .

[59]  Sebastian J Schultheiss,et al.  Kernel-based identification of regulatory modules. , 2010, Methods in molecular biology.

[60]  Weixiong Zhang,et al.  Characterization and Identification of MicroRNA Core Promoters in Four Model Species , 2007, PLoS Comput. Biol..

[61]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[62]  Lise Getoor,et al.  A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction , 2006, PKDD.

[63]  M. Boguski,et al.  dbEST — database for “expressed sequence tags” , 1993, Nature Genetics.

[64]  Simon C. Potter,et al.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls , 2007, Nature.

[65]  Xiuping Jia,et al.  Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks , 2016, IEEE Transactions on Geoscience and Remote Sensing.

[66]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[67]  Simon Kasif,et al.  Modeling splice sites with Bayes networks , 2000, Bioinform..

[68]  Zheng Rong Yang,et al.  Evaluation of Mutual Information and Genetic Programming for Feature Selection in QSAR , 2004, J. Chem. Inf. Model..

[69]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[70]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[71]  Byung Ro Moon,et al.  Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[72]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[73]  Kenneth A. De Jong,et al.  Using evolutionary computation to improve SVM classification , 2010, IEEE Congress on Evolutionary Computation.

[74]  Kenneth de Jong,et al.  Evolutionary computation: a unified approach , 2007, GECCO.

[75]  Debashis Ghosh,et al.  Feature selection and molecular classification of cancer using genetic programming. , 2007, Neoplasia.

[76]  Xue-wen Chen,et al.  Big Data Deep Learning: Challenges and Perspectives , 2014, IEEE Access.

[77]  F. P. Roth,et al.  A non-parametric model for transcription factor binding sites. , 2003, Nucleic acids research.

[78]  Jens Keilwagen,et al.  Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences , 2012, J. Mach. Learn. Res..

[79]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[80]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[81]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[82]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[83]  Joseph A. Driscoll,et al.  Classification of Gene Expression Data with Genetic Programming , 2003 .

[84]  Julie Wilson,et al.  Novel feature selection method for genetic programming using metabolomic 1H NMR data , 2006 .

[85]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[86]  Burkhard Rost,et al.  Using genetic algorithms to select most predictive protein features , 2009, Proteins.

[87]  Jason H. Moore,et al.  Symbolic discriminant analysis of microarray data in autoimmune disease , 2002, Genetic epidemiology.

[88]  W. John Wilbur,et al.  DNA splice site detection: a comparison of specific and general methods , 2002, AMIA.

[89]  Sung-Bae Cho,et al.  Lymphoma Cancer Classification Using Genetic Programming with SNR Features , 2004, EuroGP.

[90]  Gunnar Rätsch,et al.  POIMs: positional oligomer importance matrices—understanding support vector machine-based signal detectors , 2008, ISMB.

[91]  A. P. Dawid,et al.  Generative or Discriminative? Getting the Best of Both Worlds , 2007 .

[92]  Kenneth A. De Jong,et al.  Feature and Kernel Evolution for Recognition of Hypersensitive Sites in DNA Sequences , 2010, BIONETICS.

[93]  Michael I. Jordan,et al.  A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences , 2002, NIPS.

[94]  Christopher B. Burge,et al.  Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals , 2004, J. Comput. Biol..

[95]  Jacek Gondzio,et al.  Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training , 2009 .

[96]  Antonia J. Jones,et al.  Feature selection for genetic sequence classification , 1998, Bioinform..

[97]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[98]  Cheng Soon Ong,et al.  mGene: accurate SVM-based gene finding with an application to nematode genomes. , 2009, Genome research.