Where Does the Alignment Score Distribution Shape Come from?

Alignment algorithms are powerful tools for searching for homologous proteins in databases, providing a score for each sequence present in the database. It has been well known for 20 years that the shape of the score distribution looks like an extreme value distribution. The extremely large number of times biologists face this class of distributions raises the question of the evolutionary origin of this probability law. We investigated the possibility of deriving the main properties of sequence alignment score distributions from a basic evolutionary process: a duplication-divergence protein evolution process in a sequence space. Firstly, the distribution of sequences in this space was defined with respect to the genetic distance between sequences. Secondly, we derived a basic relation between the genetic distance and the alignment score. We obtained a novel score probability distribution which is qualitatively very similar to that of Karlin-Altschul but performing better than all other previous model.

[1]  M S Waterman,et al.  Rapid and accurate estimates of statistical significance for sequence data base searches. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[2]  W. Miller,et al.  A time-efficient, linear-space local similarity algorithm , 1991 .

[3]  Eric Maréchal,et al.  Construction of non-symmetric substitution matrices derived from proteomes with biased amino acid distributions. , 2005, Comptes rendus biologies.

[4]  Aleksandar Poleksic Island method for estimating the statistical significance of profile-profile alignment scores , 2008, BMC Bioinformatics.

[5]  M. Salemi,et al.  The phylogenetic handbook : a practical approach to DNA and protein phylogeny , 2003 .

[6]  Philippe Ortet,et al.  A configuration space of homologous proteins conserving mutual information and allowing a phylogeny inference based on pair-wise Z-score probabilities , 2005, BMC Bioinformatics.

[7]  Nick V. Grishin,et al.  Estimation of the number of amino acid substitutions per site when the substitution rate varies among sites , 1995, Journal of Molecular Evolution.

[8]  Sean R. Eddy,et al.  A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation , 2008, PLoS Comput. Biol..

[9]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[10]  Ankit Agrawal,et al.  Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty , 2009, BMC Bioinformatics.

[11]  Jean-Christophe Aude,et al.  Significance of Z-value Statistics of Smith-Waterman Scores for Protein Alignments , 1999, Comput. Chem..

[12]  Steven R. Finch,et al.  Mathematical constants , 2005, Encyclopedia of mathematics and its applications.

[13]  Caleb Webber,et al.  Estimation of P-values for global alignments of protein sequences , 2001, Bioinform..

[14]  Ralf Bundschuh,et al.  Rapid significance estimation in local sequence alignment with gaps , 2001, RECOMB.

[15]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[16]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[17]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[18]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[19]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[20]  W. Fitch Random sequences. , 1983, Journal of molecular biology.

[21]  Olivier Bastien,et al.  A Simple Derivation of the Distribution of Pairwise Local Protein Sequence Alignment Scores , 2008, Evolutionary bioinformatics online.

[22]  Olivier Bastien,et al.  Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores , 2008, BMC Bioinformatics.

[23]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[24]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[25]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[26]  Ankit Agrawal,et al.  Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment , 2008, Int. J. Comput. Biol. Drug Des..

[27]  Mark Borodovsky,et al.  Statistical significance in biological sequence analysis , 2006, Briefings Bioinform..

[28]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[29]  Lee Aaron Newberg Significance of Gapped Sequence Alignments , 2008, J. Comput. Biol..

[30]  Su-Shing Chen,et al.  Statistical distributions of optimal global alignment scores of random protein sequences , 2005, BMC Bioinformatics.