An evolutionary-based approach for feature generation: Eukaryotic promoter recognition

Prediction of promoter regions continues to be a challenging subproblem in mapping out eukaryotic DNA. While this task is key to understanding the regulation of differential transcription, the gene-specific architecture of promoter sequences does not readily lend itself to general strategies. To date, the best approaches are based on Support Vector Machines (SVMs) that employ standard ”spectrum” features and achieve promoter region classification accuracies from a low of 84% to a high of 94% depending on the particular species involved. In this paper, we propose a general and powerful methodology that uses Genetic Programming (GP) techniques to generate more complex and more gene-specific features to be used with a standard SVM for promoter region identification. We evaluate our methodology on three data sets from different species and observe consistent classification accuracies in the 94–95% range. In addition, because the GP-generated features are gene-specific, they can be used by biologists to advance their understanding of the architecture of eukaryotic promoter regions.

[1]  Kenneth A. De Jong,et al.  A Two-Stage Evolutionary Approach for Effective Classification of hypersensitive DNA Sequences , 2011, J. Bioinform. Comput. Biol..

[2]  R. Gangal,et al.  Human pol II promoter prediction: time series descriptors and machine learning , 2005, Nucleic acids research.

[3]  Rafael Ramírez,et al.  A Genetic Programming Approach to Feature Selection and Classification of Instantaneous Cognitive States , 2009, EvoWorkshops.

[4]  Kenneth A. De Jong,et al.  Feature and Kernel Evolution for Recognition of Hypersensitive Sites in DNA Sequences , 2010, BIONETICS.

[5]  Steen Knudsen,et al.  Promoter2.0: for the recognition of PolII promoter sequences , 1999, Bioinform..

[6]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[7]  John M. Hancock,et al.  PlantProm: a database of plant promoter sequences , 2003, Nucleic Acids Res..

[8]  J. Fickett,et al.  Eukaryotic promoter recognition. , 1997, Genome research.

[9]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[10]  Elmar Nöth,et al.  Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Byung Ro Moon,et al.  Hybrid Genetic Algorithms for Feature Selection , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  John R. Koza,et al.  Evolution of a Computer Program for Classifying Protein Segments as Transmembrane Domains Using Genetic Programming , 1994, ISMB.

[15]  Julie Wilson,et al.  Novel feature selection method for genetic programming using metabolomic 1H NMR data , 2006 .

[16]  Burkhard Rost,et al.  Using genetic algorithms to select most predictive protein features , 2009, Proteins.

[17]  Philipp Bucher,et al.  EPD in its twentieth year: towards complete promoter coverage of selected model organisms , 2005, Nucleic Acids Res..

[18]  Zheng Rong Yang,et al.  Evaluation of Mutual Information and Genetic Programming for Feature Selection in QSAR , 2004, J. Chem. Inf. Model..

[19]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[20]  Stephen F. Smith,et al.  A learning system based on genetic adaptive algorithms , 1980 .

[21]  David Andre,et al.  Classifying protein segments as transmembrane domains using architecture-altering operations in genetic programming , 1996 .

[22]  A. J. Gammerman,et al.  Plant promoter prediction with confidence estimation , 2005, Nucleic acids research.

[23]  Charles P. Staelin Parameter selection for support vector machines , 2002 .

[24]  Debashis Ghosh,et al.  Feature selection and molecular classification of cancer using genetic programming. , 2007, Neoplasia.

[25]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[26]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[27]  Richard K. Belew,et al.  New Methods for Competitive Coevolution , 1997, Evolutionary Computation.

[28]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[29]  Nichael Lynn Cramer,et al.  A Representation for the Adaptive Generation of Simple Sequential Programs , 1985, ICGA.

[30]  Kenneth A. De Jong,et al.  Using evolutionary computation to improve SVM classification , 2010, IEEE Congress on Evolutionary Computation.

[31]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[32]  Sung-Bae Cho,et al.  Lymphoma Cancer Classification Using Genetic Programming with SNR Features , 2004, EuroGP.

[33]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[34]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[35]  Sean Luke,et al.  Population Implosion in Genetic Programming , 2003, GECCO.

[36]  Kenneth A. De Jong,et al.  Selecting predictive features for recognition of hypersensitive sites of regulatory genomic sequences with an evolutionary algorithm , 2010, GECCO '10.

[37]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[38]  A. Sentenac,et al.  RNA polymerase B (II) and general transcription factors. , 1990, Annual review of biochemistry.

[39]  Ray Walshe,et al.  Pol II promoter prediction using characteristic 4-mer motifs: a machine learning approach , 2008, BMC Bioinformatics.

[40]  Vladimir B. Bajic,et al.  Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters , 2002, Bioinform..

[41]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[42]  Lakhmi C. Jain,et al.  Nearest neighbor classifier: Simultaneous editing and feature selection , 1999, Pattern Recognit. Lett..

[43]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[44]  Victor V. Solovyev,et al.  PromH: promoters identification using orthologous genomic sequences , 2003, Nucleic Acids Res..

[45]  Johanne Cohen,et al.  Shuffling biological sequences with motif constraints , 2008, J. Discrete Algorithms.