PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments

MOTIVATION The accurate prediction of residue-residue contacts, critical for maintaining the native fold of a protein, remains an open problem in the field of structural bioinformatics. Interest in this long-standing problem has increased recently with algorithmic improvements and the rapid growth in the sizes of sequence families. Progress could have major impacts in both structure and function prediction to name but two benefits. Sequence-based contact predictions are usually made by identifying correlated mutations within multiple sequence alignments (MSAs), most commonly through the information-theoretic approach of calculating mutual information between pairs of sites in proteins. These predictions are often inaccurate because the true covariation signal in the MSA is often masked by biases from many ancillary indirect-coupling or phylogenetic effects. Here we present a novel method, PSICOV, which introduces the use of sparse inverse covariance estimation to the problem of protein contact prediction. Our method builds on work which had previously demonstrated corrections for phylogenetic and entropic correlation noise and allows accurate discrimination of direct from indirectly coupled mutation correlations in the MSA. RESULTS PSICOV displays a mean precision substantially better than the best performing normalized mutual information approach and Bayesian networks. For 118 out of 150 targets, the L/5 (i.e. top-L/5 predictions for a protein of length L) precision for long-range contacts (sequence separation >23) was ≥ 0.5, which represents an improvement sufficient to be of significant benefit in protein structure prediction or model quality assessment. AVAILABILITY The PSICOV source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/PSICOV.

[1]  P Fariselli,et al.  Prediction of contact maps with neural networks and correlated mutations. , 2001, Protein engineering.

[2]  A. Mclachlan Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c 551 . , 1971, Journal of molecular biology.

[3]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[4]  Haim Ashkenazy,et al.  Reducing phylogenetic bias in correlated mutation analysis. , 2010, Protein engineering, design & selection : PEDS.

[5]  秀俊 松井,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2014 .

[6]  Burkhard Rost,et al.  PROFcon: novel prediction of long-range contacts , 2005, Bioinform..

[7]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[8]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[9]  Zheng Yuan,et al.  Better prediction of protein contact number using a support vector regression analysis of amino acid sequence , 2005, BMC Bioinformatics.

[10]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[11]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[12]  Graziano Pesole,et al.  Correlated substitution analysis and the prediction of amino acid structural contacts , 2007, Briefings Bioinform..

[13]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[14]  Erik van Nimwegen,et al.  Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments , 2010, PLoS Comput. Biol..

[15]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[16]  Thomas Huber,et al.  An introduction to protein contact prediction. , 2008, Methods in molecular biology.

[17]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..

[18]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[19]  Bin Xue,et al.  Predicting residue–residue contact maps by a two‐layer, integrated neural‐network method , 2009, Proteins.

[20]  Cristina Marino Buslje,et al.  Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information , 2009, Bioinform..

[21]  David S. Eisenberg,et al.  Using inferred residue contacts to distinguish between correct and incorrect protein models , 2008, Bioinform..

[22]  L. C. Martin,et al.  Using information theory to search for co-evolving residues in proteins , 2005, Bioinform..

[23]  E. Neher How frequent are correlated changes in families of protein sequences? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[24]  K. Burrage,et al.  Protein contact prediction using patterns of correlation , 2004, Proteins.

[25]  Burkhard Rost,et al.  EVAcon: a protein contact prediction evaluation service , 2005, Nucleic Acids Res..

[26]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[27]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Maximum Likelihood Estimation , 2007, ArXiv.

[28]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[29]  Robert M. MacCallum,et al.  Striped sheets and protein contact prediction , 2004, ISMB/ECCB.

[30]  G. Stormo,et al.  Correlated mutations in protein sequences: Phylogenetic and structural effects , 1997 .

[31]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[32]  G. Stormo,et al.  Correlated mutations in models of protein sequences: phylogenetic and structural effects , 1999 .

[33]  Pierre Baldi,et al.  Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners , 2002, ISMB.

[34]  Jens Meiler,et al.  CASP6 assessment of contact prediction , 2005, Proteins.

[35]  Najeeb M. Halabi,et al.  Protein Sectors: Evolutionary Units of Three-Dimensional Structure , 2009, Cell.

[36]  C. Sander,et al.  Correlated Mutations and Residue Contacts , 1994 .

[37]  W. Taylor,et al.  Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution. , 1997, Protein engineering.

[38]  Olivier Ledoit,et al.  Improved estimation of the covariance matrix of stock returns with an application to portfolio selection , 2003 .

[39]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[40]  Osvaldo Graña,et al.  Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8 , 2009, Proteins.

[41]  Christopher Bystroff,et al.  Predicting interresidue contacts using templates and pathways , 2003, Proteins.

[42]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[43]  M Michael Gromiha,et al.  Inter-residue interactions in protein folding and stability. , 2004, Progress in biophysics and molecular biology.

[44]  Roland L. Dunbrack,et al.  CAFASP3: The third critical assessment of fully automated structure prediction methods , 2003, Proteins.

[45]  Roland L. Dunbrack,et al.  CAFASP2: The second critical assessment of fully automated structure prediction methods , 2001, Proteins.

[46]  Piero Fariselli,et al.  Is There an Optimal Substitution Matrix for Contact Prediction with Correlated Mutations? , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[47]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[48]  A. Valencia,et al.  Improving contact predictions by the combination of correlated mutations and other sources of sequence information. , 1997, Folding & design.