Pair HMM Based Gap Statistics for Re-evaluation of Indels in Alignments with Affine Gap Penalties

Although computationally aligning sequence is a crucial step in the vast majority of comparative genomics studies our understanding of alignment biases still needs to be improved. To infer true structural or homologous regions computational alignments need further evaluation. It has been shown that the accuracy of aligned positions can drop substantially in particular around gaps. Here we focus on re-evaluation of score-based alignments with affine gap penalty costs. We exploit their relationships with pair hidden Markov models and develop efficient algorithms by which to identify gaps which are significant in terms of length and multiplicity. We evaluate our statistics with respect to the well-established structural alignments from SABmark and find that indel reliability substantially increases with their significance in particular in worst-case twilight zone alignments. This points out that our statistics can reliably complement other methods which mostly focus on the reliability of match positions.

[1]  M. Vingron,et al.  Quantifying the local reliability of a sequence alignment. , 1996, Protein engineering.

[2]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[3]  Sean R. Eddy,et al.  Biological sequence analysis: Contents , 1998 .

[4]  Artem Cherkasov,et al.  Towards Improved Assessment of Functional Similarity in Large-Scale Screens: A Study on Indel Length , 2010, J. Comput. Biol..

[5]  Amir Dembo,et al.  Strong limit theorems of empirical functionals for large exceedances of partial sums of i , 1991 .

[6]  B Qian,et al.  Distribution of indel lengths , 2001, Proteins.

[7]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[8]  Lior Pachter,et al.  Parametric Alignment of Drosophila Genomes , 2005, PLoS Comput. Biol..

[9]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Ari Löytynoja,et al.  An algorithm for progressive multiple alignment of sequences with insertions. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[12]  Sudhir Kumar,et al.  Multiple sequence alignment: in pursuit of homologous DNA positions. , 2007, Genome research.

[13]  Sheldon M. Ross,et al.  A SIMPLE DERIVATION OF EXACT RELIABILITY FORMULAS FOR LINEAR AND CIRCULAR CONSECUTIVE-k-of-n : F SYSTEMS , 1995 .

[14]  M. Waterman,et al.  A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. , 1987, Journal of molecular biology.

[15]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[16]  Vladimir G. Tumanyan,et al.  A new approach to assessing the validity of indels in algorithmic pair alignments , 2008 .

[17]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[18]  Neri Merhav,et al.  Hidden Markov processes , 2002, IEEE Trans. Inf. Theory.

[19]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[20]  Reed A. Cartwright,et al.  Logarithmic gap costs decrease alignment accuracy , 2006, BMC Bioinformatics.

[21]  István Miklós,et al.  Stochastic models of sequence evolution including insertion—deletion events , 2009, Statistical methods in medical research.

[22]  Kenneth P. Bogart,et al.  Introductory Combinatorics , 1977 .

[23]  Julien Clément,et al.  Constructions for Clumps Statistics , 2008, ArXiv.

[24]  Markos V. Koutras,et al.  Distribution Theory of Runs: A Markov Chain Approach , 1994 .

[25]  Alfonso Valencia,et al.  Predicting reliable regions in protein alignments from sequence profiles. , 2003, Journal of molecular biology.

[26]  Alexandre Z. Caldeira,et al.  Uncertainty in homology inferences: assessing and improving genomic sequence alignment. , 2008, Genome research.

[27]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[29]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[30]  Maximilian Schlosshauer,et al.  A novel approach to local reliability of sequence alignments , 2002, Bioinform..

[31]  Melissa S. Cline,et al.  Predicting reliable regions in protein sequence alignments , 2002, Bioinform..

[32]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[33]  Steven A Benner,et al.  Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. , 2004, Journal of molecular biology.