Context similarity scoring improves protein sequence alignments in the midnight zone

MOTIVATION High-quality protein sequence alignments are essential for a number of downstream applications such as template-based protein structure prediction. In addition to the similarity score between sequence profile columns, many current profile-profile alignment tools use extra terms that compare 1D-structural properties such as secondary structure and solvent accessibility, which are predicted from short profile windows around each sequence position. Such scores add non-redundant information by evaluating the conservation of local patterns of hydrophobicity and other amino acid properties and thus exploiting correlations between profile columns. RESULTS Here, instead of predicting and comparing known 1D properties, we follow an agnostic approach. We learn in an unsupervised fashion a set of maximally conserved patterns represented by 13-residue sequence profiles, without the need to know the cause of the conservation of these patterns. We use a maximum likelihood approach to train a set of 32 such profiles that can best represent patterns conserved within pairs of remotely homologs, structurally aligned training profiles. We include the new context score into our Hmm-Hmm alignment tool hhsearch and improve especially the quality of difficult alignments significantly. CONCLUSION The context similarity score improves the quality of homology models and other methods that depend on accurate pairwise alignments.

[1]  Jian Peng,et al.  Boosting Protein Threading Accuracy , 2009, RECOMB.

[2]  K. Karplus,et al.  Hidden Markov models that use predicted local structure for fold recognition: Alphabets of backbone geometry , 2003, Proteins.

[3]  Mindaugas Margelevicius,et al.  Detection of distant evolutionary relationships between protein families using theory of sequence profile-profile comparison , 2010, BMC Bioinformatics.

[4]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[7]  A. Biegert,et al.  Sequence context-specific profiles for homology searching , 2009, Proceedings of the National Academy of Sciences.

[8]  Jonathan Casper,et al.  Combining local‐structure, fold‐recognition, and new fold methods for protein structure prediction , 2003, Proteins.

[9]  Arne Elofsson,et al.  A study on protein sequence alignment quality , 2002, Proteins.

[10]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[11]  Kevin Karplus,et al.  PREDICT-2ND: a tool for generalized protein local structure prediction , 2008, Bioinform..

[12]  Feng Zhao,et al.  Protein threading using context-specific alignment potential , 2013, Bioinform..

[13]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[14]  Sitao Wu,et al.  MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information , 2008, Proteins.

[15]  Yang Zhang,et al.  A comparative assessment and analysis of 20 representative sequence alignment methods for protein structure prediction , 2013, Scientific Reports.

[16]  Yaoqi Zhou,et al.  Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates , 2011, Bioinform..

[17]  Burkhard Rost,et al.  Improving fold recognition without folds. , 2004, Journal of molecular biology.

[18]  Kevin Karplus,et al.  Evaluation of local structure alphabets based on residue burial , 2004, Proteins.

[19]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[20]  Johannes Söding,et al.  Discriminative modelling of context-specific amino acid substitution probabilities , 2012, Bioinform..

[21]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[22]  Song Liu,et al.  Fold recognition by concurrent use of solvent accessibility and residue depth , 2007, Proteins.

[23]  Arne Elofsson,et al.  Improved alignment quality by combining evolutionary information, predicted secondary structure and self-organizing maps , 2006, BMC Bioinform..

[24]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[25]  Markus Porto,et al.  High quality protein sequence alignment by combining structural profile prediction and profile alignment using SABERTOOTH , 2010, BMC Bioinformatics.

[26]  Y Xu,et al.  Protein threading using PROSPECT: Design and evaluation , 2000, Proteins.

[27]  Jianzhu Ma,et al.  A conditional neural fields model for protein threading , 2013 .

[28]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.