A conditional neural fields model for protein threading

Motivation: Alignment errors are still the main bottleneck for current template-based protein modeling (TM) methods, including protein threading and homology modeling, especially when the sequence identity between two proteins under consideration is low (<30%). Results: We present a novel protein threading method, CNFpred, which achieves much more accurate sequence–template alignment by employing a probabilistic graphical model called a Conditional Neural Field (CNF), which aligns one protein sequence to its remote template using a non-linear scoring function. This scoring function accounts for correlation among a variety of protein sequence and structure features, makes use of information in the neighborhood of two residues to be aligned, and is thus much more sensitive than the widely used linear or profile-based scoring function. To train this CNF threading model, we employ a novel quality-sensitive method, instead of the standard maximum-likelihood method, to maximize directly the expected quality of the training set. Experimental results show that CNFpred generates significantly better alignments than the best profile-based and threading methods on several public (but small) benchmarks as well as our own large dataset. CNFpred outperforms others regardless of the lengths or classes of proteins, and works particularly well for proteins with sparse sequence profiles due to the effective utilization of structure information. Our methodology can also be adapted to protein sequence alignment. Contact: j3xu@ttic.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[2]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[3]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[4]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[5]  G. Gonnet,et al.  Exhaustive matching of the entire protein sequence database. , 1992, Science.

[6]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[7]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[8]  M. Karplus,et al.  Evaluation of comparative protein modeling by MODELLER , 1995, Proteins.

[9]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[10]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[11]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[12]  M. Sippl,et al.  Structure-derived substitution matrices for alignment of distantly related sequences. , 2000, Protein engineering.

[13]  M. Sippl,et al.  ProSup: a refined tool for protein structure alignment. , 2000, Protein engineering.

[14]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[15]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[16]  M. Sternberg,et al.  Enhanced genome annotation using structural profiles in the program 3D-PSSM. , 2000, Journal of molecular biology.

[17]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[18]  Jimin Pei,et al.  AL2CO: calculation of positional conservation in a protein sequence alignment , 2001, Bioinform..

[19]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[20]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[21]  Ying Xu,et al.  Raptor: Optimal Protein Threading by Linear Programming , 2003, J. Bioinform. Comput. Biol..

[22]  Bernard F. Buxton,et al.  The DISOPRED server for the prediction of protein disorder , 2004, Bioinform..

[23]  Eleazar Eskin,et al.  Discrete profile alignment via constrained information bottleneck , 2004, NIPS.

[24]  Michael Brudno,et al.  PROBCONS: Probabilistic Consistency-Based Multiple Alignment of Amino Acid Sequences , 2004, AAAI.

[25]  A. Sali,et al.  Alignment of protein sequences by their profiles , 2004, Protein science : a publication of the Protein Society.

[26]  Tatsuya Akutsu,et al.  Clustering of database sequences for fast homology search using upper bounds on alignment score. , 2004, Genome informatics. International Conference on Genome Informatics.

[27]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[28]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[29]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[30]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[31]  Jinbo Xu Fold recognition by predicted alignment accuracy , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[32]  D. Cozzetto,et al.  Relationship between multiple sequence alignments and quality of protein comparative models , 2004, Proteins.

[33]  Richard Mott Smith–Waterman Algorithm , 2005 .

[34]  Yaoqi Zhou,et al.  SPARKS 2 and SP3 servers in CASP6 , 2005, Proteins.

[35]  Leszek Rychlewski,et al.  FFAS03: a server for profile–profile sequence alignments , 2005, Nucleic Acids Res..

[36]  Stefano Toppo,et al.  Improving the quality of protein structure models by selecting from alignment alternatives , 2006, BMC Bioinform..

[37]  Yen Hock Tan,et al.  Statistical potential‐based amino acid similarity matrices for aligning distantly related protein sequences , 2006, Proteins.

[38]  Ron Elber,et al.  SSALN: An alignment algorithm using structure‐dependent substitution matrices and gap penalties learned from structurally aligned protein pairs , 2005, Proteins.

[39]  Hiroki Arimura,et al.  Hardness Results on Local Multiple Alignment of Biological Sequences , 2007 .

[40]  Sagi Snir,et al.  Incorporating homologues into Sequence Embeddings for protein Analysis , 2007, J. Bioinform. Comput. Biol..

[41]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2006, Nucleic Acids Research.

[42]  Wei Zhang,et al.  SP5: Improving Protein Fold Recognition by Using Torsion Angle Profiles and Profile-Based Gap Penalty Model , 2008, PloS one.

[43]  Lenore Cowen,et al.  Matt: Local Flexibility Aids Protein Multiple Structure Alignment , 2008, PLoS Comput. Biol..

[44]  Johannes Söding,et al.  De novo identification of highly diverged protein repeats by probabilistic consistency , 2008, Bioinform..

[45]  Sitao Wu,et al.  MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information , 2008, Proteins.

[46]  A. Biegert,et al.  Sequence context-specific profiles for homology searching , 2009, Proceedings of the National Academy of Sciences.

[47]  Lenore Cowen,et al.  Augmented training of hidden Markov models to recognize remote homologs via simulated evolution , 2009, Bioinform..

[48]  Johannes Söding,et al.  Fast and accurate automatic structure prediction with HHpred , 2009, Proteins.

[49]  Maksims Volkovs,et al.  BoltzRank: learning to maximize expected ranking gain , 2009, ICML '09.

[50]  Jian Peng,et al.  Boosting Protein Threading Accuracy , 2009, RECOMB.

[51]  Jian Peng,et al.  Conditional Neural Fields , 2009, NIPS.

[52]  Alexander Schönhuth,et al.  Pair HMM Based Gap Statistics for Re-evaluation of Indels in Alignments with Affine Gap Penalties , 2010, WABI.

[53]  Feng Zhao,et al.  Fragment-free approach to protein folding using conditional neural fields , 2010, Bioinform..

[54]  Zhiyong Wang,et al.  Protein 8-class secondary structure prediction using Conditional Neural Fields , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[55]  Lukasz A. Kurgan,et al.  In-silico prediction of disorder content using hybrid sequence representation , 2011, BMC Bioinformatics.

[56]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[57]  Xuegong Zhang,et al.  Sequence Alignment as Hypothesis Testing , 2011, J. Comput. Biol..

[58]  Srinivas Devadas,et al.  Simultaneous Alignment and Folding of Protein Sequences , 2014, J. Comput. Biol..