A Two-Stage Evolutionary Approach for Effective Classification of hypersensitive DNA Sequences

Hypersensitive (HS) sites in genomic sequences are reliable markers of DNA regulatory regions that control gene expression. Annotation of regulatory regions is important in understanding phenotypical differences among cells and diseases linked to pathologies in protein expression. Several computational techniques are devoted to mapping out regulatory regions in DNA by initially identifying HS sequences. Statistical learning techniques like Support Vector Machines (SVM), for instance, are employed to classify DNA sequences as HS or non-HS. This paper proposes a method to automate the basic steps in designing an SVM that improves the accuracy of such classification. The method proceeds in two stages and makes use of evolutionary algorithms. An evolutionary algorithm first designs optimal sequence motifs to associate explicit discriminating feature vectors with input DNA sequences. A second evolutionary algorithm then designs SVM kernel functions and parameters that optimally separate the HS and non-HS classes. Results show that this two-stage method significantly improves SVM classification accuracy. The method promises to be generally useful in automating the analysis of biological sequences, and we post its source code on our website.

[1]  Boonserm Kijsirikul,et al.  Evolutionary strategies for multi-scale radial basis function kernels in support vector machines , 2005, GECCO '05.

[2]  E. Newport,et al.  Science Current Directions in Psychological Statistical Learning : from Acquiring Specific Items to Forming General Rules on Behalf Of: Association for Psychological Science , 2022 .

[3]  Cheng-Lung Huang,et al.  A GA-based feature selection and parameters optimizationfor support vector machines , 2006, Expert Syst. Appl..

[4]  Kenneth A. De Jong,et al.  Selecting predictive features for recognition of hypersensitive sites of regulatory genomic sequences with an evolutionary algorithm , 2010, GECCO '10.

[5]  J. Hughes,et al.  Using genomics to study how chromatin influences gene expression. , 2007, Annual review of genomics and human genetics.

[6]  Heitor Silvério Lopes,et al.  A Comparative Study of Machine Learning Methods for Detecting Promoters in Bacterial DNA Sequences , 2008, ICIC.

[7]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[8]  Christina S. Leslie,et al.  Fast String Kernels using Inexact Matching for Protein Sequences , 2004, J. Mach. Learn. Res..

[9]  Chaoyang Zhang,et al.  Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition , 2008, BMC Genomics.

[10]  J. Stamatoyannopoulos,et al.  High-throughput localization of functional elements by quantitative chromatin profiling , 2004, Nature Methods.

[11]  D. S. Gross,et al.  Nuclease hypersensitive sites in chromatin. , 1988, Annual review of biochemistry.

[12]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[13]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[14]  Sean Luke,et al.  Population Implosion in Genetic Programming , 2003, GECCO.

[15]  K. Heller,et al.  Sequence information for the splicing of human pre-mRNA identified by support vector machine classification. , 2003, Genome research.

[16]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[17]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[18]  Ingo Mierswa,et al.  Evolutionary learning with kernels: a generic solution for large margin problems , 2006, GECCO '06.

[19]  William Stafford Noble,et al.  Predicting the in vivo signature of human gene regulatory sequence , 2005, ISMB.

[20]  A. Nienhuis,et al.  Mechanism of DNase I hypersensitive site formation within the human globin locus control region. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Lise Getoor,et al.  A Feature Generation Algorithm with Applications to Bio- logical Sequence Classification , 2007 .

[22]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[23]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[24]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[25]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[26]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[27]  Sean Luke,et al.  Evolving kernels for support vector machine classification , 2007, GECCO '07.

[28]  J. Stamatoyannopoulos,et al.  Genome-wide identification of DNaseI hypersensitive sites using active chromatin sequence libraries. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Christian Igel,et al.  Evolutionary tuning of multiple SVM parameters , 2005, ESANN.

[30]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Multiclass SVM Model Selection Using Particle Swarm Optimization , 2006, 2006 Sixth International Conference on Hybrid Intelligent Systems (HIS'06).

[31]  Michael Litt,et al.  The insulation of genes from external enhancers and silencing chromatin , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[33]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[34]  Lise Getoor,et al.  Features generated for computational splice-site prediction correspond to functional elements , 2007, BMC Bioinformatics.

[35]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[36]  Michael R. Green,et al.  Transcriptional regulatory elements in the human genome. , 2006, Annual review of genomics and human genetics.

[37]  Carl Wu The 5′ ends of Drosophila heat shock genes in chromatin are hypersensitive to DNase I , 1980, Nature.