Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments

BackgroundWhile the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate.ResultsWe compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10-5) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance.ConclusionsThe pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments.

[1]  William R. Pearson,et al.  Visualization of near-optimal sequence alignments , 2004, Bioinform..

[2]  Melissa S. Cline,et al.  Predicting reliable regions in protein sequence alignments , 2002, Bioinform..

[3]  M. Zuker Suboptimal sequence alignment in molecular biology. Alignment with error analysis. , 1991, Journal of molecular biology.

[4]  Ian Holmes,et al.  Dynamic programming alignment accuracy , 1998, RECOMB '98.

[5]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[6]  M J Sternberg,et al.  A simple method to generate non-trivial alternate alignments of protein sequences. , 1991, Journal of molecular biology.

[7]  Adam Godzik,et al.  In search for more accurate alignments in the twilight zone , 2002, Protein science : a publication of the Protein Society.

[8]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[9]  BMC Bioinformatics , 2005 .

[10]  M. Miyamoto,et al.  Sequence alignments and pair hidden Markov models using evolutionary history. , 2003, Journal of molecular biology.

[11]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[12]  Ellen J. Bass,et al.  A System for Visualizing and Analyzing Near-Optimal Protein Sequence Alignments , 2005, Inf. Vis..

[13]  Kun-Mao Chao,et al.  Locating well-conserved regions within a pairwise alignment , 1993, Comput. Appl. Biosci..

[14]  Peter F. Stadler,et al.  Stochastic pairwise alignments , 2002, ECCB.

[15]  T G Marr,et al.  Alignment of molecular sequences seen as random path analysis. , 1995, Journal of theoretical biology.

[16]  M. Vingron,et al.  Quantifying the local reliability of a sequence alignment. , 1996, Protein engineering.

[17]  K. Nishikawa,et al.  Protein structure comparison using the Markov transition model of evolution , 2000, Proteins.

[18]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Chris Sander,et al.  Completeness in structural genomics , 2001, Nature Structural Biology.

[20]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[21]  Liisa Holm,et al.  DaliLite workbench for protein structure comparison , 2000, Bioinform..

[22]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[23]  G. Kleywegt Use of non-crystallographic symmetry in protein structure refinement. , 1996, Acta crystallographica. Section D, Biological crystallography.

[24]  R. Lathrop The protein threading problem with sequence amino acid interaction preferences is NP-complete. , 1994, Protein engineering.

[25]  D. Hosmer,et al.  A comparison of goodness-of-fit tests for the logistic regression model. , 1997, Statistics in medicine.

[26]  Michael S. Waterman,et al.  A dynamic programming algorithm to find all solutions in a neighborhood of the optimum , 1985 .

[27]  M. Levitt,et al.  A unified statistical framework for sequence comparison and structure comparison. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Ceslovas Venclovas,et al.  Comparative modeling in CASP5: Progress is evident, but alignment errors remain a significant hindrance , 2003, Proteins.

[29]  Terence Hwa,et al.  Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models , 2001, J. Comput. Biol..

[30]  S. Miyazawa A reliable sequence alignment method based on probabilities of residue correspondences. , 1995, Protein engineering.

[31]  Lihua Yu,et al.  Positional Statistical Significance in Sequence Alignment , 1999, J. Comput. Biol..

[32]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[33]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .