Score statistics of global sequence alignment from the energy distribution of a modified directed polymer and directed percolation problem.

Sequence alignment is one of the most important bioinformatics tools for modern molecular biology. The statistical characterization of gapped alignment scores has been a long-standing problem in sequence alignment research. Using a variant of the directed path in random media model, we investigate the score statistics of global sequence alignment taking into account, in particular, the compositional bias of the sequences compared. Such statistics are used to distinguish accidental similarity due to compositional similarity from biologically significant similarity. To accommodate the compositional bias, we introduce an extra parameter p indicating the probability for positive matching scores to occur. When p is small, a high scoring alignment obviously cannot come from compositional similarity. When p is large, the highest scoring point within a global alignment tends to be close to the end of both sequences, in which case we say the system percolates. By applying finite-size scaling theory on percolating probability functions of various sizes (sequence lengths), the critical p at infinite size is obtained. For alignment of length t, the fact that the score fluctuation grows as chi(t)1/3 is confirmed upon investigating the scaling form of the alignment score. Using the Kolmogorov-Smirnov statistics test, we show that the random variable , if properly scaled, follows the Tracy-Widom distributions: Gaussian orthogonal ensemble for p slightly larger than pc and Gaussian unitary ensemble for larger p. Although these results deepen our understanding of the distribution of alignment scores, the use of these results in practical applications remains somewhat heuristic and needs to be further developed. Nevertheless, the possibility of characterizing score statistics for modest system size (sequence lengths), via proper reparametrization of alignment scores, is illustrated.

[1]  Alexander K Hartmann,et al.  Sampling rare events: statistics of local sequence alignments. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[2]  Fisher,et al.  Directed paths in a random potential. , 1991, Physical review. B, Condensed matter.

[3]  R. Bundschuh,et al.  Asymmetric exclusion process and extremal statistics of random sequences. , 1999, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  Spohn,et al.  Universal distributions for growth processes in 1+1 dimensions and random matrices , 2000, Physical review letters.

[5]  S. Altschul,et al.  The compositional adjustment of amino acid substitution matrices , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[7]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[8]  Mehran Kardar,et al.  REPLICA BETHE ANSATZ STUDIES OF TWO-DIMENSIONAL INTERFACES WITH QUENCHED RANDOM IMPURITIES , 1987 .

[9]  T. Halpin-Healy DIRECTED POLYMERS VERSUS DIRECTED PERCOLATION , 1998 .

[10]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[11]  K. Johansson Shape Fluctuations and Random Matrices , 1999, math/9903134.

[12]  D. Huse,et al.  Pinning and roughening of domain walls in Ising systems due to random impurities. , 1985, Physical review letters.

[13]  Terence Hwa,et al.  Hybrid alignment: high-performance with universal statistics , 2002, Bioinform..

[14]  Terence Hwa,et al.  Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models , 2001, J. Comput. Biol..

[15]  Yi-Kuo Yu Replica model for an unusual directed polymer in 1+1 dimensions and prediction of the extremal parameter of gapped sequence alignment statistics. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  On the connection between directed percolation and directed polymers , 1995 .

[17]  Limiting Distributions for a Polynuclear Growth Model with External Sources , 2000, math/0003130.

[18]  R F Doolittle,et al.  Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. , 1996, Methods in enzymology.

[19]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[20]  Stephen F. Altschul,et al.  The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions , 2005, Bioinform..

[21]  C. Tracy,et al.  Mathematical Physics © Springer-Verlag 1996 On Orthogonal and Symplectic Matrix Ensembles , 1995 .

[22]  Michael Lässig,et al.  Toward an accurate statistics of gapped alignments , 2005, Bulletin of mathematical biology.

[23]  David R. Nelson,et al.  Large-distance and long-time properties of a randomly stirred fluid , 1977 .

[24]  Anisotropic ballistic deposition model with links to the Ulam problem and the Tracy-Widom distribution. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[25]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[26]  H. Spohn,et al.  Statistical Self-Similarity of One-Dimensional Growth Processes , 1999, cond-mat/9910273.

[27]  T G Marr,et al.  Alignment of molecular sequences seen as random path analysis. , 1995, Journal of theoretical biology.

[28]  A. Vershik,et al.  Asymptotic of the largest and the typical dimensions of irreducible representations of a symmetric group , 1985 .

[29]  Hwa,et al.  Similarity detection and localization. , 1995, Physical review letters.

[30]  Yicheng Zhang,et al.  Kinetic roughening phenomena, stochastic growth, directed polymers and all that. Aspects of multidisciplinary statistical mechanics , 1995 .

[31]  J. Baik,et al.  On the distribution of the length of the longest increasing subsequence of random permutations , 1998, math/9810105.

[32]  S. Majumdar,et al.  Exact asymptotic results for the Bernoulli matching model of sequence alignment. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.