There have been many different approaches employed to define the "consensus" sequence of various DNA binding sites and to use the definition obtained to locate and rank members of a given sequence family. The analysis presented here enlists two of these approaches, each in modified form, to develop a highly efficient search protocol for Escherichia coli promoters and to provide a relative ranking of these sites showing good agreement with in vitro measurements of promoter strength. Schneider et al. have applied Shannon's index of information content to evaluate the significance of each position within the consensus of a family of aligned sequences. In a formal sense, this index is only applicable to a group of sequences, providing at each position a negative entropy value between zero (random) and two bits (total conservation of a single base) for sequences in which all bases are equally represented. A method for evaluating how well an individual sequence conforms to the information content pattern of the consensus is described. A function is derived, by analogy to the information content of the sequence family, for application to individual sequences. Since this function is a measure of conformity, it can be used in a search protocol to identify new members of the family represented by the consensus. A protocol for locating E. coli promoters is presented. The Berg-von Hippel statistical-mechanical function is also tested in a similar application. While the information content function provides a superior search protocol, the Berg-von Hippel function, when scaled at each position by the information content, does well at ranking promoters according to their strength as measured in vitro.
[1]
R. Bambara,et al.
On the statistical significance of primary structural features found in DNA-protein interaction sites.
,
1975,
Nucleic acids research.
[2]
G. Studnicka,et al.
Nucleotide sequence homologies in control regions of prokaryotic genomes.
,
1987,
Gene.
[3]
R Staden.
Computer methods to locate signals in nucleic acid sequences
,
1984,
Nucleic Acids Res..
[4]
T. D. Schneider,et al.
Information content of binding sites on nucleotide sequences.
,
1986,
Journal of molecular biology.
[5]
R. Harr,et al.
Search algorithm for pattern match analysis of nucleic acid sequences.
,
1983,
Nucleic acids research.
[6]
J. F. Collins,et al.
Applications of parallel processing algorithms for DNA sequence analysis
,
1984,
Nucleic Acids Res..
[7]
P. V. von Hippel,et al.
Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters.
,
1987,
Journal of molecular biology.
[8]
D. K. Hawley,et al.
Compilation and analysis of Escherichia coli promoter DNA sequences.
,
1983,
Nucleic acids research.
[9]
Robert Entriken,et al.
Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity
,
1984,
Nucleic Acids Res..
[10]
Martin E. Mulligan,et al.
Analysis of the occurrence of promoter-sites in DNA
,
1986,
Nucleic Acids Res..