Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning

MOTIVATION Protein contact prediction is important for protein structure and functional study. Both evolutionary coupling (EC) analysis and supervised machine learning methods have been developed, making use of different information sources. However, contact prediction is still challenging especially for proteins without a large number of sequence homologs. RESULTS This article presents a group graphical lasso (GGL) method for contact prediction that integrates joint multi-family EC analysis and supervised learning to improve accuracy on proteins without many sequence homologs. Different from existing single-family EC analysis that uses residue coevolution information in only the target protein family, our joint EC analysis uses residue coevolution in both the target family and its related families, which may have divergent sequences but similar folds. To implement this, we model a set of related protein families using Gaussian graphical models and then coestimate their parameters by maximum-likelihood, subject to the constraint that these parameters shall be similar to some degree. Our GGL method can also integrate supervised learning methods to further improve accuracy. Experiments show that our method outperforms existing methods on proteins without thousands of sequence homologs, and that our method performs better on both conserved and family-specific contacts. AVAILABILITY AND IMPLEMENTATION See http://raptorx.uchicago.edu/ContactMap/ for a web server implementing the method. CONTACT j3xu@ttic.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  M. Hestenes Multiplier and gradient methods , 1969 .

[2]  Anna Tramontano,et al.  Evaluation of residue–residue contact prediction in CASP10 , 2014, Proteins.

[3]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[4]  Yen Hock Tan,et al.  Statistical potential‐based amino acid similarity matrices for aligning distantly related protein sequences , 2006, Proteins.

[5]  Timothy Nugent,et al.  Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis , 2012, Proceedings of the National Academy of Sciences.

[6]  SödingJohannes Protein homology detection by HMM--HMM comparison , 2005 .

[7]  Marcin J. Skwark,et al.  Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns , 2014, PLoS Comput. Biol..

[8]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[9]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[10]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[11]  Serafim Batzoglou,et al.  CONTRAlign: Discriminative Training for Protein Sequence Alignment , 2006, RECOMB.

[12]  Zhiyong Wang,et al.  Predicting protein contact map using evolutionary and physical constraints by integer programming , 2013, Bioinform..

[13]  Pierre Baldi,et al.  Deep architectures for protein contact map prediction , 2012, Bioinform..

[14]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[15]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  Chris Bailey-Kellogg,et al.  Protein Design by Sampling an Undirected Graphical Model of Residue Constraints , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Jianlin Cheng,et al.  Predicting protein residue-residue contacts using deep networks and boosting , 2012, Bioinform..

[18]  Jinbo Xu,et al.  A position-specific distance-dependent statistical potential for protein structure and functional study. , 2012, Structure.

[19]  Kevin Karplus,et al.  Contact prediction using mutual information and neural nets , 2007, Proteins.

[20]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[21]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[22]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[23]  A. Valencia,et al.  Emerging methods in protein co-evolution , 2013, Nature Reviews Genetics.

[24]  G. Stormo,et al.  Correlated mutations in models of protein sequences: phylogenetic and structural effects , 1999 .

[25]  David T. Jones,et al.  MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins , 2014, Bioinform..

[26]  Marcin J. Skwark,et al.  PconsC: combination of direct information methods and alignments improves contact prediction , 2013, Bioinform..

[27]  Erik van Nimwegen,et al.  Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments , 2010, PLoS Comput. Biol..

[28]  Iain M. Wallace,et al.  M-Coffee: combining multiple sequence alignment methods with T-Coffee , 2006, Nucleic acids research.

[29]  Chris Bailey-Kellogg,et al.  Graphical Models of Residue Coupling in Protein Families , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Liisa Holm,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm358 Sequence analysis , 2022 .

[31]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.

[32]  Sheng Wang,et al.  Protein Homology Detection Through Alignment of Markov Random Fields , 2015, SpringerBriefs in Computer Science.

[33]  C. Floudas,et al.  ASTRO-FOLD: a combinatorial and global optimization framework for Ab initio prediction of three-dimensional structures of proteins from the amino acid sequence. , 2003, Biophysical journal.

[34]  F. Morcos,et al.  Genomics-aided structure prediction , 2012, Proceedings of the National Academy of Sciences.

[35]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[36]  David E. Kim,et al.  One contact for every twelve residues allows robust and accurate topology‐level protein structure modeling , 2014, Proteins.

[37]  Piero Fariselli,et al.  Is There an Optimal Substitution Matrix for Contact Prediction with Correlated Mutations? , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[38]  Chris Bailey-Kellogg,et al.  Graphical models of protein–protein interaction specificity from correlated mutations and interaction data , 2009, Proteins.

[39]  Stephen P. Boyd,et al.  An ADMM Algorithm for a Class of Total Variation Regularized Estimation Problems , 2012, 1203.1828.

[40]  Simona Cocco,et al.  From Principal Component to Direct Coupling Analysis of Coevolution in Proteins: Low-Eigenvalue Modes are Needed for Structure Prediction , 2012, PLoS Comput. Biol..

[41]  Patrick Danaher,et al.  The joint graphical lasso for inverse covariance estimation across multiple classes , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[42]  Jianlin Cheng,et al.  NNcon: improved protein contact map prediction using 2D-recursive neural networks , 2009, Nucleic Acids Res..

[43]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[44]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[45]  Christopher Jarzynski,et al.  Using Sequence Alignments to Predict Protein Structure and Stability With High Accuracy , 2012, 1207.2484.

[46]  Jinbo Xu,et al.  A multiple‐template approach to protein threading , 2011, Proteins.

[47]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[48]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[49]  Zhiyong Wang,et al.  MRFalign: Protein Homology Detection through Alignment of Markov Random Fields , 2014, PLoS Comput. Biol..

[50]  Yang Zhang,et al.  A comprehensive assessment of sequence-based and template-based methods for protein contact prediction , 2008, Bioinform..