Using evolutionary computation to improve SVM classification

Support vector machines (SVMs) are now one of the most popular machine learning techniques for solving difficult classification problems. Their effectiveness depends on two critical design decisions: 1) mapping a decision problem into an n-dimensional feature space, and 2) choosing a kernel function that maps the n-dimensional feature space into a higher dimensional and more effective classification space. The choice of kernel functions is generally limited to a small set of well-studied candidates. However, the choice of a feature set is much more open-ended without much design guidance. In fact, many SVMs are designed with standard generic feature space mappings embedded a priori. In this paper we describe a procedure for using an evolutionary algorithm to design more compact non-standard feature mappings that, for a fixed kernel function, significantly improves the classification accuracy of the constructed SVM.

[1]  M. L. Howard,et al.  cis-Regulatory control circuits in development. , 2004, Developmental biology.

[2]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[3]  G. Felsenfeld,et al.  Chromatin Unfolds , 1996, Cell.

[4]  Xizhao Wang,et al.  Optimization of combined kernel function for SVM based on large margin learning theory , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[5]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  Lise Getoor,et al.  SplicePort—An interactive splice-site analysis tool , 2007, Nucleic Acids Res..

[8]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[9]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[10]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[11]  Nomenclature Committee of the International Union of Biochemistry (NC-IUB). Nomenclature of electron-transfer proteins. Recommendations 1989. , 1992, The Journal of biological chemistry.

[12]  Patrick Wincker,et al.  Large-scale gene discovery in the pea aphid Acyrthosiphon pisum (Hemiptera) , 2006, Genome Biology.

[13]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[14]  Keith Vertanen,et al.  Genetic Adventures in Parallel : Towards a Good Island Model under PVM , 2004 .

[15]  E. Davidson Genomic Regulatory Systems: Development and Evolution , 2005 .

[16]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[17]  J. Yang,et al.  Near-optimal feature selection for large databases , 2009, J. Oper. Res. Soc..

[18]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[19]  Lise Getoor,et al.  Features generated for computational splice-site prediction correspond to functional elements , 2007, BMC Bioinformatics.

[20]  Jihoon Yang,et al.  Feature Subset Selection Using a Genetic Algorithm , 1998, IEEE Intell. Syst..

[21]  J. Stamatoyannopoulos,et al.  NF‐E2 and GATA binding motifs are required for the formation of DNase I hypersensitive site 4 of the human beta‐globin locus control region. , 1995, The EMBO journal.

[22]  Erick Cantú-Paz,et al.  Feature Subset Selection, Class Separability, and Genetic Algorithms , 2004, GECCO.

[23]  Kenneth A. De Jong,et al.  Genetic algorithms as a tool for feature selection in machine learning , 1992, Proceedings Fourth International Conference on Tools with Artificial Intelligence TAI '92.

[24]  William Stafford Noble,et al.  Predicting the in vivo signature of human gene regulatory sequence , 2005, ISMB.

[25]  Portland Press Ltd Nomenclature Committee for the International Union of Biochemistry (NC-IUB). Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. , 1985, Molecular biology and evolution.

[26]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[27]  Chaoyang Zhang,et al.  Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition , 2008, BMC Genomics.

[28]  Ming-Zhu Lu,et al.  Optimization of combined kernel function for SVM by Particle Swarm Optimization , 2009, 2009 International Conference on Machine Learning and Cybernetics.

[29]  J. Stamatoyannopoulos,et al.  Genome-wide identification of DNaseI hypersensitive sites using active chromatin sequence libraries. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[31]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[32]  Lise Getoor,et al.  A Feature Generation Algorithm with Applications to Bio- logical Sequence Classification , 2007 .

[33]  Dong Seong Kim,et al.  Determining Optimal Decision Model for Support Vector Machine by Genetic Algorithm , 2004, CIS.

[34]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[35]  Michael Q. Zhang,et al.  A weight array method for splicing signal analysis , 1993, Comput. Appl. Biosci..

[36]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.