Detection of Eukaryotic Promoters Using Markov Transition Matrices

Eukaryotic promoters are among the most important functional domains yet to be characterized in a satisfactory manner in genomic sequences. Most current detection methods rely on the recognition of individual transcription elements using position-weight matrices (PWM) or consensus sequences. Here, we study a simple promoter detection algorithm based on Markov transition matrices built from sequences upward from proven transcription initiation sites. The performances have been evaluated on the training set and on a test set of promoter-containing sequences. The results on the training set are surprisingly good, given that the algorithm does not incorporate any specific knowledge about promoters. Yet, the program exhibits the pathological behaviour typical of all training set-based methods: a significant decline in performance when confronted with previously unseen sequences. Thus, the Markov algorithm, like the others presently available, does not truly capture the essence of eukaryotic promoters. A detection program based on a Markov model is likely to be blind to categories of promoters without close representatives in the training set.

[1]  M J Sternberg,et al.  Prediction of structural and functional features of protein and nucleic acid sequences by artificial neural networks. , 1992, Biochemistry.

[2]  Ying Xu,et al.  Detection of RNA Polymerase II Promoters and Polyadenylation Sites in Human DNA Sequence , 1996, Comput. Chem..

[3]  Philipp Bucher,et al.  The Eukaryotic Promoter Database EPD , 1998, Nucleic Acids Res..

[4]  G. Stormo,et al.  Escherichia coli promoter sequences: analysis and prediction. , 1996, Methods in enzymology.

[5]  G. Zhou,et al.  Neural network optimization for E. coli promoter prediction. , 1991, Nucleic acids research.

[6]  P. Bucher Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. , 1990, Journal of molecular biology.

[7]  M Kanehisa,et al.  An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. , 1992, Nucleic acids research.

[8]  Jean-Michel Claverie,et al.  Assessing the biological significance of primary structure consensus patterns using sequence databanks. I. Heat-shock and glucocorticoid control elements in eukaryotic promoters , 1985, Comput. Appl. Biosci..

[9]  N N Alexandrov,et al.  Application of a new method of pattern recognition in DNA sequence analysis: a study of E. coli promoters. , 1990, Nucleic acids research.

[10]  D. S. Prestridge Predicting Pol II promoter sequences using transcription factor binding sites. , 1995, Journal of molecular biology.

[11]  Mark Borodovsky,et al.  GENMARK: Parallel Gene Recognition for Both DNA Strands , 1993, Comput. Chem..

[12]  J. Fickett,et al.  Assessment of protein coding measures. , 1992, Nucleic acids research.

[13]  I Sauvaget,et al.  K-tuple frequency analysis: from intron/exon discrimination to T-cell epitope mapping. , 1990, Methods in enzymology.

[14]  G. B. Hutchinson,et al.  The prediction of vertebrate promoter regions using differential hexamer frequency analysis , 1996, Comput. Appl. Biosci..

[15]  O Kennard,et al.  The EMBL nucleotide sequence data library. , 1984, Biochemical Society transactions.

[16]  E. Wingender,et al.  Recognition of regulatory regions in genomic sequences. , 1994, Journal of biotechnology.

[17]  Jean-Michel Claverie,et al.  The statistical significance of nucleotide position-weight matrix matches , 1996, Comput. Appl. Biosci..

[18]  Jean-Michel Claverie,et al.  The Difficulty of Identifying Genes in Anonymous Vertebrate Sequences , 1997, Comput. Chem..