Identification of consensus patterns in unaligned dna and protein sequences: a large-deviation stati

Using log-likelihood statistics to compare sequence alignments, we have been able to determine alignments from multiple, unaligned, functionally related, DNA (Stormo and Hartzell. 1989. Proc. Natl. Acad. Sci. USA 86, 1183–1187; Hertz et al. 1990. Comput. Appl. Biosci. 6, 81–92) and protein sequences. In this paper, we reanalyze DNA sequences that bind the E. coli repressor LexA to demonstrate the ability of our scoring scheme to identify patterns when each sequence can contain zero or more binding sites. The scoring formula we have used previously does not allow for insertions and deletions in the alignments. In this paper, we use large-deviation statistics to extend the scoring formula to allow for insertions and deletions. The insertion-deletion penalty of this scoring scheme depends exclusively on the observed alignment rather than on previous observations or the user’s intuition. We also describe the close relationship between our formulas and hidden markov models. Finally, we present results of applying this new scoring formula to align a set of E. coli promoter DNA sequences.

[1]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[2]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  J. Richardson,et al.  Simultaneous comparison of three protein sequences. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[5]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[6]  D. Bacon,et al.  Multiple Sequence Alignment , 1986, Journal of molecular biology.

[7]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[8]  M. Sternberg,et al.  A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. , 1987, Journal of molecular biology.

[9]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. , 1987, Journal of molecular biology.

[10]  C. Harley,et al.  Analysis of E. coli promoter sequences. , 1987, Nucleic acids research.

[11]  Timothy R. C. Read,et al.  Goodness-Of-Fit Statistics for Discrete Multivariate Data , 1988 .

[12]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Rodger Staden,et al.  Methods for calculating the probabilities of finding patterns in sequences , 1989, Comput. Appl. Biosci..

[14]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[15]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[16]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[17]  G. Stormo,et al.  Specificity of the Mnt protein determined by binding to randomized operators. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[18]  J. Collado-Vides,et al.  Control site location and transcriptional regulation in Escherichia coli. , 1991, Microbiological reviews.

[19]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[20]  G. Parmigiani Large Deviation Techniques in Decision, Simulation and Estimation , 1992 .

[21]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[22]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[23]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.