Large-Deviation Properties of Sequence Alignment of Correlated Sequences

The significance of alignment scores of optimally aligned DNA sequences can be estimated through the score distribution of pairs of random sequences. It is necessary to obtain statistics for the relevant high-scoring tail of the distribution. For local alignments of iid drawn sequences it has already been shown that the often assumed Gumbel distribution does not hold in the distribution tail, but has to be corrected by a Gaussian factor. Real DNA sequences were observed to show long-range correlations within sequences, which are not correctly modeled by iid random sequences. In this publication the large deviation method that was used in previous studies is applied to local and global alignment of such sequences with long-range correlations. We study the distributions over the full range of the support and obtained probabilities as low as [Formula: see text]. We show that again a correction to the Gumbel distribution is necessary to study the dependence of the parameters on the correlation strength. For global alignments the Gamma distribution, which was found heuristically to be a good fit in earlier simple sampling studies, is found to be a poor fit.

[1]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[2]  Alexander K Hartmann,et al.  Sampling rare events: statistics of local sequence alignments. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[3]  Martin Vingron,et al.  Effects of Long-Range Correlations in DNA on Sequence Alignment Score Statistics , 2007, J. Comput. Biol..

[4]  Pascal Fieth,et al.  Score distributions of gapped multiple sequence alignments down to the low-probability tail. , 2016, Physical review. E.

[5]  K. Hukushima,et al.  Exchange Monte Carlo Method and Application to Spin Glass Simulations , 1995, cond-mat/9512035.

[6]  Philipp W. Messer,et al.  CorGen—measuring and generating long-range correlations for DNA sequence analysis , 2006, Nucleic Acids Res..

[7]  Lee Aaron Newberg Significance of Gapped Sequence Alignments , 2008, J. Comput. Biol..

[8]  Wentian Li,et al.  Long-range correlation and partial 1/fα spectrum in a noncoding DNA sequence , 1992 .

[9]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[10]  G. Parisi,et al.  Simulated tempering: a new Monte Carlo scheme , 1992, hep-lat/9205018.

[11]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[12]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[13]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[14]  Stefan Wolfsheimer,et al.  Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail , 2007, Algorithms for Molecular Biology.

[15]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[16]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[17]  Amir Dembo,et al.  Statistical Composition of High-Scoring Segments from Molecular Sequences , 1990 .

[18]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[19]  Su-Shing Chen,et al.  Statistical distributions of optimal global alignment scores of random protein sequences , 2005, BMC Bioinformatics.