Statistical significance of ungapped sequence alignments.

Statistical significance of a local sequence alignment depends not only on the similarity score and on the sequence lengths, but also on a length of the alignment. Dependence of the alignment significance on the length of the sequences has been analyzed earlier, and is based on the idea that the longer sequences have more chances to share a local similarity with a bigger score. To the best of our knowledge, a dependence of the statistical significance on the length of an alignment has not been used in selecting the best alignments. We have applied to real proteins formulas for assessing the statistical significance of ungapped local alignments. Let L be a length of the alignment, then the expected value of a similarity score is Sexp = * L, where is the expected similarity between two randomly chosen residues. Value of can be calculated from a similarity (substitution) matrix M and amino acid frequencies P. = sigma ij pi*pj*mij. The probability of observing a score S greater than or equal to x for an alignment of length L is given by the normal distribution: Prob(S > or = x) = 1-integral of N ((S-Sexp)/sigma) = 1-integral of N((S-*L)/sigma m square root of L), where sigma m is a standard deviation of m. From these formula, we conclude, that we should select the best alignment using a normalized value of the similarity score as follows: S' = max ¿(S-*L)/ sigma m square root of L¿. The proposed normalization of the similarity score has been tested on the representative benchmark. To evaluate a performance of the normalization, we have calculated several measures of the recognition quality. Our normalization has improved all these measures. This procedure is important for choosing the correct alignment for homology modelling as well as for selecting distantly related sequences in databases.

[1]  M. Waterman,et al.  Stochastic scrabble: large deviations for sequences with scores , 1988, Journal of Applied Probability.

[2]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[3]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[4]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[5]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[6]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[7]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.