Sequence alignment kernel for recognition of promoter regions

UNLABELLED In this paper we propose a new method for recognition of prokaryotic promoter regions with startpoints of transcription. The method is based on Sequence Alignment Kernel, a function reflecting the quantitative measure of match between two sequences. This kernel function is further used in Dual SVM, which performs the recognition. Several recognition methods have been trained and tested on positive data set, consisting of 669 sigma70-promoter regions with known transcription startpoints of Escherichia coli and two negative data sets of 709 examples each, taken from coding and non-coding regions of the same genome. The results show that our method performs well and achieves 16.5% average error rate on positive & coding negative data and 18.6% average error rate on positive & non-coding negative data. AVAILABILITY The demo version of our method is accessible from our website http://mendel.cs.rhul.ac.uk/

[1]  Michael C. O'Neill,et al.  Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes , 1992, Nucleic Acids Res..

[2]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[3]  M. Thomas Record,et al.  RNA Polymerase-Promoter Interactions: the Comings and Goings of RNA Polymerase , 1998, Journal of bacteriology.

[4]  Bernhard Schölkopf,et al.  Prior Knowledge in Support Vector Kernels , 1997, NIPS.

[5]  C. Watkins Dynamic Alignment Kernels , 1999 .

[6]  M. O'Neill,et al.  Escherichia coli promoters. II. A spacing class-dependent promoter search protocol. , 1989, The Journal of biological chemistry.

[7]  Anders Gorm Pedersen,et al.  Investigations of Escherichia coli Promoter Sequences with Artificial Neural Networks: New Signals Discovered Upstream of the Transcriptional Startpoint , 1995, ISMB.

[8]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[9]  M Kanehisa,et al.  An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. , 1992, Nucleic acids research.

[10]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.

[11]  I Mahadevan,et al.  Analysis of E.coli promoter structures using neural networks. , 1994, Nucleic acids research.

[12]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[13]  Hanah Margalit,et al.  PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptional start sites , 2001, Nucleic Acids Res..

[14]  Alexander Gammerman,et al.  Prediction algorithms and confidence measures based on algorithmic randomness theory , 2002, Theor. Comput. Sci..

[15]  Julio Collado-Vides,et al.  RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12 , 2001, Nucleic Acids Res..

[16]  N N Alexandrov,et al.  Application of a new method of pattern recognition in DNA sequence analysis: a study of E. coli promoters. , 1990, Nucleic acids research.

[17]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[18]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[19]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[20]  David Sankoff,et al.  Frequency of insertion-deletion, transversion, and transition in the evolution of 5S ribosomal RNA , 1976, Journal of Molecular Evolution.

[21]  M. Crochemore,et al.  Motifs in Sequences: Localization and Extraction , 2004 .

[22]  Denis Thieffry,et al.  Syntactic recognition of regulatory regions in Escherichia coli , 1996, Comput. Appl. Biosci..

[23]  Martin E. Mulligan,et al.  Analysis of the occurrence of promoter-sites in DNA , 1986, Nucleic Acids Res..

[24]  M. O'Neill,et al.  Training back-propagation neural networks to define and detect DNA-binding sites. , 1991, Nucleic acids research.

[25]  Chris Mellish,et al.  Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequences , 2001, Bioinform..

[26]  Pierre Baldi,et al.  Characterization of Prokaryotic and Eukaryotic Promoters Using Hidden Markov Models , 1996, ISMB.

[27]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[29]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[30]  Julio Collado-Vides,et al.  RegulonDB (version 3.0): transcriptional regulation and operon organization in Escherichia coli K-12 , 2000, Nucleic Acids Res..

[31]  A V Lukashin,et al.  Neural network models for promoter recognition. , 1989, Journal of biomolecular structure & dynamics.

[32]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[33]  Terrence S. Furey,et al.  Promoter Region-Based Classification of Genes , 2000, Pacific Symposium on Biocomputing.

[34]  Martin Tompa,et al.  An Exact Method for Finding Short Motifs in Sequences, with Application to the Ribosome Binding Site Problem , 1999, ISMB.

[35]  Winston Hide,et al.  A Statistical Model for Prokaryotic Promoter Prediction , 1998 .

[36]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[37]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[38]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[39]  William Noble Grundy,et al.  Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[40]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[41]  G. Stormo,et al.  Escherichia coli promoter sequences: analysis and prediction. , 1996, Methods in enzymology.