Fast pseudolikelihood maximization for direct-coupling analysis of protein structure from many homologous amino-acid sequences

Abstract Direct-coupling analysis is a group of methods to harvest information about coevolving residues in a protein family by learning a generative model in an exponential family from data. In protein families of realistic size, this learning can only be done approximately, and there is a trade-off between inference precision and computational speed. We here show that an earlier introduced l 2 -regularized pseudolikelihood maximization method called plmDCA can be modified as to be easily parallelizable, as well as inherently faster on a single processor, at negligible difference in accuracy. We test the new incarnation of the method on 143 protein family/structure-pairs from the Protein Families database (PFAM), one of the larger tests of this class of algorithms to date.

[1]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[2]  Muriel Médard,et al.  Network deconvolution as a general method to distinguish direct dependencies in networks , 2013, Nature Biotechnology.

[3]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[4]  Zhiyong Wang,et al.  Predicting protein contact map using evolutionary and physical constraints by integer programming , 2013, Bioinform..

[5]  Olivier Rivoire Elements of coevolution in biological sequences. , 2013, Physical review letters.

[6]  R. Monasson,et al.  Inference of Hopfield-Potts patterns from covariation in protein families: calculation and statistical error bars , 2013 .

[7]  Gregory A.Petsko and Dagmar Ringe Protein structure and function , 2003 .

[8]  David L. Wild,et al.  Predicting protein β-sheet contacts using a maximum entropy-based correlated mutation measure , 2013, Bioinform..

[9]  Sheng Wang,et al.  Protein contact prediction by joint evolutionary coupling analysis across multiple families , 2013, ArXiv.

[10]  Simona Cocco,et al.  From Principal Component to Direct Coupling Analysis of Coevolution in Proteins: Low-Eigenvalue Modes are Needed for Structure Prediction , 2012, PLoS Comput. Biol..

[11]  Carlo Baldassi,et al.  Fast and Accurate Multivariate Gaussian Modeling of Protein Families: Predicting Residue Contacts and Protein-Interaction Partners , 2014, PloS one.

[12]  Thomas A. Hopf,et al.  Sequence co-evolution gives 3D contacts and structures of protein complexes , 2014, eLife.

[13]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[14]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[15]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[16]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[17]  E. Aurell,et al.  Inverse Ising inference using all the data. , 2011, Physical review letters.

[18]  Piero Fariselli,et al.  BCov: a method for predicting β-sheet topology using sparse inverse covariance estimation and integer programming , 2013, Bioinform..

[19]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[20]  W. P. Russ,et al.  Natural-like function in artificial WW domains , 2005, Nature.

[21]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[22]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[23]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[24]  Terence Hwa,et al.  Inference of direct residue contacts in two-component signaling. , 2010, Methods in enzymology.

[25]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[26]  Michael I. Jordan Graphical Models , 1998 .

[27]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[28]  Zhiyong Wang,et al.  Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning , 2013, Bioinform..

[29]  R. Monasson,et al.  Adaptive Cluster Expansion for the Inverse Ising Problem: Convergence, Algorithm and Tests , 2011, 1110.5416.

[30]  B. O. Koopman On distributions admitting a sufficient statistic , 1936 .

[31]  Sanzo Miyazawa,et al.  Prediction of Contact Residue Pairs Based on Co-Substitution between Sites in Protein Structures , 2013, PloS one.

[32]  Guido Tiana,et al.  The network of stabilizing contacts in proteins studied by coevolutionary data. , 2013, The Journal of chemical physics.

[33]  F. Ricci-Tersenghi The Bethe approximation for solving the inverse Ising problem: a comparison with other inference methods , 2011, 1112.4814.

[34]  R. Monasson,et al.  Small-correlation expansions for the inverse Ising problem , 2008, 0811.3574.

[35]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[36]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[37]  S. Balaji,et al.  SUPFAM: A database of sequence superfamilies of protein domains , 2004, BMC Bioinformatics.

[38]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[39]  Marcin J. Skwark,et al.  Improving Contact Prediction along Three Dimensions , 2014, PLoS Comput. Biol..

[40]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.

[41]  Erik van Nimwegen,et al.  Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments , 2010, PLoS Comput. Biol..

[42]  K Fidelis,et al.  A large‐scale experiment to assess protein structure prediction methods , 1995, Proteins.

[43]  Sameer Velankar,et al.  E-MSD: an integrated data resource for bioinformatics , 2004, Nucleic Acids Res..

[44]  Simona Cocco,et al.  Adaptive Cluster Expansion for Inferring Boltzmann Machines with Noisy Data , 2011, Physical review letters.

[45]  E. Pitman,et al.  Sufficient statistics and intrinsic accuracy , 1936, Mathematical Proceedings of the Cambridge Philosophical Society.

[46]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[47]  G. Stormo,et al.  Correlated mutations in models of protein sequences: phylogenetic and structural effects , 1999 .

[48]  Lubert Stryer,et al.  Protein structure and function , 2005, Experientia.

[49]  Marcin J. Skwark,et al.  PconsC: combination of direct information methods and alignments improves contact prediction , 2013, Bioinform..

[50]  Shuai Cheng Li,et al.  Prediction of residue-residue contacts from protein families using similarity kernels and least squares regularization , 2013, 1311.1301.

[51]  N D Clarke,et al.  Covariation of residues in the homeodomain sequence family , 1995, Protein science : a publication of the Protein Society.

[52]  E. Neher How frequent are correlated changes in families of protein sequences? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Sreeurpa Ray,et al.  The Cell: A Molecular Approach , 1996 .