Sequence Comparison Significance and Poisson Approximation

The Chen-Stein method of Poisson approximation has been used to establish theorems about comparison of two DNA or protein sequences. The most useful result for sequence alignment applies to alignment scoring with no gaps. However, there has not been a valid method to assign statistical significance to alignment scores with gaps. In this paper we extend Poisson approximation techniques using the Aldous clumping heuristic to a practical method of estimating statistical significance.

[1]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[2]  R F Doolittle,et al.  Simian sarcoma virus onc gene, v-sis, is derived from the gene (or genes) encoding a platelet-derived growth factor. , 1983, Science.

[3]  M. O. Dayhoff,et al.  Viral src gene products are related to the catalytic chain of mammalian cAMP-dependent protein kinase. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[4]  L. Tsui,et al.  Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. , 1989, Science.

[5]  M. Waterman,et al.  The Erdos-Renyi Law in Distribution, for Coin Tossing and Sequence Matching , 1990 .

[6]  David Sankoff,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[7]  L. Gordon,et al.  Poisson Approximation and the Chen-Stein Method , 1990 .

[8]  J. Kingman Subadditive Ergodic Theory , 1973 .

[9]  M. O. Dayhoff,et al.  Establishing homologies in protein sequences. , 1983, Methods in enzymology.

[10]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[11]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .

[12]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[13]  R. Doolittle Similar amino acid sequences: chance or common ancestry? , 1981, Science.

[14]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[15]  P Argos,et al.  Oligopeptide biases in protein sequences and their use in predicting protein coding regions in nucleotide sequences , 1988, Proteins.

[16]  T. Peters,et al.  Identification of the cystic fibrosis gene. , 1990, BMJ.

[17]  L. Gordon,et al.  Two moments su ce for Poisson approx-imations: the Chen-Stein method , 1989 .

[18]  D. Lipman,et al.  Rapid similarity searches of nucleic acid and protein data banks. , 1983, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[20]  J. F. Collins,et al.  Protein and Nucleic Acid Sequence Database Searching: A Suitable Case for Parallel processing , 1987, Comput. J..

[21]  M. Waterman,et al.  A Phase Transition for the Score in Matching Random Sequences Allowing Deletions , 1994 .

[22]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[23]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[24]  J. F. Collins,et al.  The significance of protein sequence similarities , 1988, Comput. Appl. Biosci..

[25]  D. Aldous Probability Approximations via the Poisson Clumping Heuristic , 1988 .

[26]  Temple F. Smith,et al.  The statistical distribution of nucleic acid similarities. , 1985, Nucleic acids research.

[27]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[28]  Claudia Neuhauser,et al.  A Poisson Approximation for Sequence Comparisons with Insertions and Deletions , 1994 .

[29]  M S Waterman,et al.  Poisson, compound Poisson and process approximations for testing statistical significance in sequence comparisons. , 1992, Bulletin of mathematical biology.

[30]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[31]  V. Chvátal,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.