Estimating Pairwise Statistical Significance of Protein Local Alignments Using a Clustering-Classification Approach Based on Amino Acid Composition

A central question in pairwise sequence comparison is assessingthe statistical significance of the alignment. The alignment scoredistribution is known to follow an extreme value distribution with analyticallycalculable parameters K and λ for ungapped alignments withone substitution matrix. But no statistical theory is currently availablefor the gapped case and for alignments using multiple scoring matrices,although their score distribution is known to closely follow extremevalue distribution and the corresponding parameters can be estimated bysimulation. Ideal estimation would require simulation for each sequencepair, which is impractical. In this paper, we present a simple clusteringclassificationapproach based on amino acid composition to estimate Kand λ for a given sequence pair and scoring scheme, including using multipleparameter sets. The resulting set of K and λ for different clusterpairs has large variability even for the same scoring scheme, underscoringthe heavy dependence of K and λ on the amino acid composition. Theproposed approach in this paper is an attempt to separate the influenceof amino acid composition in estimation of statistical significance of pairwiseprotein alignments. Experiments and analysis of other approachesto estimate statistical parameters also indicate that the methods used inthis work estimate the statistical significance with good accuracy.

[1]  Richard Mott Alignment: Statistical Significance , 2005 .

[2]  P. Sellers Pattern recognition in genetic sequences by mismatch density , 1984 .

[3]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[4]  Rolf Olsen,et al.  Rapid Assessment of Extremal Statistics for Gapped Local Alignment , 1999, ISMB.

[5]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[6]  W. Miller,et al.  A time-efficient, linear-space local similarity algorithm , 1991 .

[7]  Benjamin Yakir,et al.  Large deviations for global maxima of independent superadditive processes with negative drift and an application to optimal sequence alignments , 2004 .

[8]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[9]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[10]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[11]  Aleksandar Poleksic,et al.  Convergent Island Statistics: a fast method for determining local alignment score significance , 2005, Bioinform..

[12]  Kun-Mao Chao,et al.  A generalized global alignment algorithm , 2003, Bioinform..

[13]  R. Mott,et al.  Accurate formula for P-values of gapped local sequence and profile alignments. , 2000, Journal of molecular biology.

[14]  Mark Borodovsky,et al.  Statistical significance in biological sequence analysis , 2006, Briefings Bioinform..

[15]  C. O'Connor An introduction to multivariate statistical analysis: 2nd edn. by T. W. Anderson. 675 pp. Wiley, New York (1984) , 1987 .

[16]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[17]  Ankit Agrawal,et al.  Pairwise Statistical Significance Versus Database Statistical Significance for Local Alignment of Protein Sequences , 2008, ISBRA.

[18]  Jaromír Antoch,et al.  Environment for statistical computing , 2008, Comput. Sci. Rev..

[19]  W R Pearson,et al.  Flexible sequence similarity searching with the FASTA3 program package. , 2000, Methods in molecular biology.

[20]  Richard Mott,et al.  Approximate Statistics of Gapped Alignments , 1999, J. Comput. Biol..

[21]  Michael Lässig,et al.  Toward an accurate statistics of gapped alignments , 2005, Bulletin of mathematical biology.

[22]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[23]  William R. Pearson,et al.  Statistical Significance in Biological Sequence Comparison , 2004 .

[24]  D. Balding,et al.  Handbook of statistical genetics , 2004 .

[25]  D. Brutlag,et al.  Dynamic use of multiple parameter sets in sequence alignment , 2006, Nucleic acids research.

[26]  W. Pearson Effective protein sequence comparison. , 1996, Methods in enzymology.

[27]  Peter Delves,et al.  Encyclopedia of life sciences , 2009 .

[28]  Ralf Bundschuh,et al.  Rapid significance estimation in local sequence alignment with gaps , 2001, RECOMB.