Statistical Significance of Normalized Global Alignment

The comparison of homologous proteins from different species is a first step toward a function assignment and a reconstruction of the species evolution. Though local alignment is mostly used for this purpose, global alignment is important for constructing multiple alignments or phylogenetic trees. However, statistical significance of global alignments is not completely clear, lacking a specific statistical model to describe alignments or depending on computationally expensive methods like Z-score. Recently we presented a normalized global alignment, defined as the best compromise between global alignment cost and length, and showed that this new technique led to better classification results than Z-score at a much lower computational cost. However, it is necessary to analyze the statistical significance of the normalized global alignment in order to be considered a completely functional algorithm for protein alignment. Experiments with unrelated proteins extracted from the SCOP ASTRAL database showed that normalized global alignment scores can be fitted to a log-normal distribution. This fact, obtained without any theoretical support, can be used to derive statistical significance of normalized global alignments. Results are summarized in a table with fitted parameters for different scoring schemes.

[1]  Olivier Bastien,et al.  Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics , 2004, Bioinform..

[2]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[3]  Enrique Vidal,et al.  Fast Computation of Normalized Edit Distances , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Monya Baker,et al.  Next-generation sequencing: adjusting to data overload , 2010, Nature Methods.

[5]  R. Mott,et al.  Accurate formula for P-values of gapped local sequence and profile alignments. , 2000, Journal of molecular biology.

[6]  Jean-Christophe Aude,et al.  Significance of Z-value Statistics of Smith-Waterman Scores for Protein Alignments , 1999, Comput. Chem..

[7]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Nicholas L. Bray,et al.  AVID: A global alignment program. , 2003, Genome research.

[10]  Ralf Bundschuh Rapid Significance Estimation in Local Sequence Alignment with Gaps , 2002, J. Comput. Biol..

[11]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[12]  M. Zuker Suboptimal sequence alignment in molecular biology. Alignment with error analysis. , 1991, Journal of molecular biology.

[13]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Andrés Marzal,et al.  A Screening Method for Z-Value Assessment Based on the Normalized Edit Distance , 2009, IWANN.

[15]  John L. Spouge,et al.  The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment , 2005, Nucleic acids research.

[16]  Caleb Webber,et al.  Estimation of P-values for global alignments of protein sequences , 2001, Bioinform..

[17]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[18]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[19]  S. Altschul,et al.  The estimation of statistical parameters for local alignment score distributions. , 2001, Nucleic acids research.

[20]  Olivier Bastien,et al.  Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores , 2008, BMC Bioinformatics.

[21]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[22]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[23]  Georg Weidenspointner,et al.  Self-terminating diffraction gates femtosecond X-ray nanocrystallography measurements , 2011, Nature Photonics.

[24]  Werner Dinkelbach On Nonlinear Fractional Programming , 1967 .

[25]  Steven E. Brenner,et al.  Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap , 2005, Bioinform..

[26]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[27]  Enrique Vidal,et al.  Computation of Normalized Edit Distance and Applications , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  David Firth,et al.  Multiplicative Errors: Log‐Normal or Gamma? , 1988 .

[29]  A. Marzal,et al.  Normalized global alignment for protein sequences. , 2011, Journal of theoretical biology.

[30]  Su-Shing Chen,et al.  Statistical distributions of optimal global alignment scores of random protein sequences , 2005, BMC Bioinformatics.

[31]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.