Significance of Z-value Statistics of Smith-Waterman Scores for Protein Alignments

The Z-value is an attempt to estimate the statistical significance of a Smith-Waterman dynamic alignment score (SW-score) through the use of a Monte-Carlo process. It partly reduces the bias induced by the composition and length of the sequences. This paper is not a theoretical study on the distribution of SW-scores and Z-values. Rather, it presents a statistical analysis of Z-values on large datasets of protein sequences, leading to a law of probability that the experimental Z-values follow. First, we determine the relationships between the computed Z-value, an estimation of its variance and the number of randomizations in the Monte-Carlo process. Then, we illustrate that Z-values are less correlated to sequence lengths than SW-scores. Then we show that pairwise alignments, performed on 'quasi-real' sequences (i.e., randomly shuffled sequences of the same length and amino acid composition as the real ones) lead to Z-value distributions that statistically fit the extreme value distribution, more precisely the Gumbel distribution (global EVD, Extreme Value Distribution). However, for real protein sequences, we observe an over-representation of high Z-values. We determine first a cutoff value which separates these overestimated Z-values from those which follow the global EVD. We then show that the interesting part of the tail of distribution of Z-values can be approximated by another EVD (i.e., an EVD which differs from the global EVD) or by a Pareto law. This has been confirmed for all proteins analysed so far, whether extracted from individual genomes, or from the ensemble of five complete microbial genomes comprising altogether 16956 protein sequences.

[1]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[2]  P. Slonimski,et al.  A data‐base of chromosome III of Saccharomyces cerevisiae , 1993, Yeast.

[3]  J. Risler,et al.  A comparison of several similarity indices used in the classification of protein sequences: a multivariate analysis. , 1992, Nucleic acids research.

[4]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[5]  Jean-Christophe Aude,et al.  10 - Automatic Analysis of Large-scale Pairwise Alignments of Protein Sequences , 1999 .

[6]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[7]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[8]  Y. Diaz-Lazcoz,et al.  Evolution of genes, evolution of species: the case of aminoacyl-tRNA synthetases. , 1998, Molecular biology and evolution.

[9]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[11]  Daniel Zajdenweber,et al.  Extreme Values in Business Interruption Insurance , 1996 .

[12]  Jean-Jacques Codani,et al.  LASSAP, a LArge Scale Sequence compArison Package , 1997, Comput. Appl. Biosci..

[13]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[14]  Alain Hénaut,et al.  The First Laws of Genomics , 1997 .

[15]  P. R. Fisk,et al.  Distributions in Statistics: Continuous Multivariate Distributions , 1971 .

[16]  Martin Vingron,et al.  Sequence Comparison Significance and Poisson Approximation , 1994 .

[17]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[18]  Mark Gerstein,et al.  Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence , 1998, Bioinform..

[19]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[20]  Richard Mott,et al.  Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores , 1992 .

[21]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[22]  W. John Wilbur,et al.  On the statistical significance of nucleic acid similarities , 1984, Nucleic Acids Res..

[23]  Amir Dembo,et al.  Statistical Composition of High-Scoring Segments from Molecular Sequences , 1990 .

[24]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[25]  Sarah A. Teichmann,et al.  DIVCLUS: an automatic method in the GEANFAMMER package that finds homologous domains in single- and multi-domain proteins , 1998, Bioinform..

[26]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[27]  R. Abagyan,et al.  Do aligned sequences share the same fold? , 1997, Journal of molecular biology.

[28]  P. Argos,et al.  An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. , 1995, Journal of molecular biology.

[29]  W. Miller,et al.  A time-efficient, linear-space local similarity algorithm , 1991 .