Aligning Protein Sequences with Predicted Secondary Structure

Accurately aligning distant protein sequences is notoriously difficult. Since the amino acid sequence alone often does not provide enough information to obtain accurate alignments under the standard alignment scoring functions, a recent approach to improving alignment accuracy is to use additional information such as secondary structure. We make several advances in alignment of protein sequences annotated with predicted secondary structure: (1) more accurate models for scoring alignments, (2) efficient algorithms for optimal alignment under these models, and (3) improved learning criteria for setting model parameters through inverse alignment, as well as (4) in-depth experiments evaluating model variants on benchmark alignments. More specifically, the new models use secondary structure predictions and their confidences to modify the scoring of both substitutions and gaps. All models have efficient algorithms for optimal pairwise alignment that run in near-quadratic time. These models have many parameters, which are rigorously learned using inverse alignment under a new criterion that carefully balances score error and recovery error. We then evaluate these models by studying how accurately an optimal alignment under the model recovers benchmark reference alignments that are based on the known three-dimensional structures of the proteins. The experiments show that these new models provide a significant boost in accuracy over the standard model for distant sequences. The improvement for pairwise alignment is as much as 15% for sequences with less than 25% identity, while for multiple alignment the improvement is more than 20% for difficult benchmarks whose accuracy under standard tools is at most 40%.

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[3]  Olivier Poch,et al.  BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations , 2001, Nucleic Acids Res..

[4]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[5]  Yue Lu,et al.  Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences , 2007, RECOMB.

[6]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[7]  John D. Kececioglu,et al.  Aligning alignments exactly , 2004, RECOMB.

[8]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[9]  John D. Kececioglu,et al.  Multiple alignment by aligning alignments , 2007, ISMB/ECCB.

[10]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[11]  SödingJohannes Protein homology detection by HMM--HMM comparison , 2005 .

[12]  J. G. Pierce,et al.  Geometric Algorithms and Combinatorial Optimization , 2016 .

[13]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[14]  Eagu Kim,et al.  Inverse Parametric Alignment for Accurate Biological Sequence Comparison , 2008 .

[15]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[16]  Jaap Heringa,et al.  PRALINE: a multiple sequence alignment toolbox that integrates homology-extended and secondary structure information , 2005, Nucleic Acids Res..

[17]  John P. Overington,et al.  HOMSTRAD: A database of protein structure alignments for homologous families , 1998, Protein science : a publication of the Protein Society.

[18]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[19]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[20]  David Fernández-Baca,et al.  Inverse Parametric Sequence Alignment , 2002, COCOON.

[21]  Mark de Berg,et al.  Computational geometry: algorithms and applications , 1997 .

[22]  Yücel Altunbasak,et al.  Protein secondary structure prediction for a single-sequence using hidden semi-Markov models , 2006, BMC Bioinformatics.

[23]  Travis John Wheeler,et al.  Efficient Construction of accurate Multiple alignments and Large-Scale phylogenies , 2009 .

[24]  John D. Kececioglu,et al.  Simple and Fast Inverse Alignment , 2006, RECOMB.

[25]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[26]  A. D. McLachlan,et al.  Secondary structure‐based profiles: Use of structure‐conserving scoring tables in searching protein sequence databases for structural similarities , 1991, Proteins.

[27]  D Gusfield,et al.  Parametric and inverse-parametric sequence alignment with XPARAL. , 1996, Methods in enzymology.

[28]  Jerrold R. Griggs,et al.  On the number of alignments ofk sequences , 1990, Graphs Comb..

[29]  Yaoqi Zhou,et al.  SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. , 2005, Bioinformatics.

[30]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[31]  Phil Cunningham Biological Sequence Analysis. Probabilistic Models of Proteins and Nucleic Acids. R. Durbin, S. Eddy, A. Krogh and G. Mitchison , 1999 .

[32]  John D. Kececioglu,et al.  Learning Scoring Schemes for Sequence Alignment from Partial Examples , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Ravinder Singh,et al.  Fast-Find: A novel computational approach to analyzing combinatorial motifs , 2006, BMC Bioinformatics.

[34]  John D. Kececioglu,et al.  Inverse Sequence Alignment from Partial Examples , 2007, WABI.

[35]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[36]  S. Balaji,et al.  PALI: a database of alignments and phylogeny of homologous protein structures , 2001, Bioinform..

[37]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[38]  Lode Wyns,et al.  Align-m-a new algorithm for multiple alignment of highly divergent sequences , 2004, Bioinform..

[39]  Mark de Berg,et al.  Computational geometry: algorithms and applications, 3rd Edition , 1997 .

[40]  Thorsten Joachims,et al.  Support Vector Training of Protein Alignment Models , 2007, RECOMB.

[41]  John D. Kececioglu,et al.  Learning Models for Aligning Protein Sequences with Predicted Secondary Structure , 2009, RECOMB.

[42]  O. Gotoh Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. , 1996, Journal of molecular biology.

[43]  Raffaele Giancarlo,et al.  Speeding up Dynamic Programming with Applications to Molecular Biology , 1989, Theor. Comput. Sci..

[44]  William J. Cook,et al.  Combinatorial optimization , 1997 .

[45]  E. Myers,et al.  Sequence comparison with concave weighting functions. , 1988, Bulletin of mathematical biology.

[46]  Dean Starrett Optimal Alignment of Multiple Sequence Alignments , 2008 .

[47]  S. Balaji,et al.  PALI - a database of Phylogeny and ALIgnment of homologous protein structures , 2001, Nucleic Acids Res..

[48]  Jimin Pei,et al.  PROMALS: towards accurate multiple sequence alignments of distantly related proteins , 2007, Bioinform..