Correlated mutations via regularized multinomial regression

BackgroundIn addition to sequence conservation, protein multiple sequence alignments contain evolutionary signal in the form of correlated variation among amino acid positions. This signal indicates positions in the sequence that influence each other, and can be applied for the prediction of intra- or intermolecular contacts. Although various approaches exist for the detection of such correlated mutations, in general these methods utilize only pairwise correlations. Hence, they tend to conflate direct and indirect dependencies.ResultsWe propose RMRCM, a method for Regularized Multinomial Regression in order to obtain Correlated Mutations from protein multiple sequence alignments. Importantly, our method is not restricted to pairwise (column-column) comparisons only, but takes into account the network nature of relationships between protein residues in order to predict residue-residue contacts. The use of regularization ensures that the number of predicted links between columns in the multiple sequence alignment remains limited, preventing overprediction. Using simulated datasets we analyzed the performance of our approach in predicting residue-residue contacts, and studied how it is influenced by various types of noise. For various biological datasets, validation with protein structure data indicates a good performance of the proposed algorithm for the prediction of residue-residue contacts, in comparison to previous results. RMRCM can also be applied to predict interactions (in addition to only predicting interaction sites or contact sites), as demonstrated by predicting PDZ-peptide interactions.ConclusionsA novel method is presented, which uses regularized multinomial regression in order to obtain correlated mutations from protein multiple sequence alignments.AvailabilityR-code of our implementation is available via http://www.ab.wur.nl/rmrcm

[1]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[2]  H. Wolfson,et al.  Correlated mutations: Advances and limitations. A study on fusion proteins and on the Cohesin‐Dockerin families , 2006, Proteins.

[3]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[4]  Chris Bailey-Kellogg,et al.  Hypergraph Model of Multi-residue Interactions in Proteins: Sequentially-Constrained Partitioning Algorithms for Optimization of Site-Directed Protein Recombination , 2006, RECOMB.

[5]  Simon A. A. Travers,et al.  A Novel Method for Detecting Intramolecular Coevolution: Adding a Further Dimension to Selective Constraints Analyses , 2006, Genetics.

[6]  W. P. Russ,et al.  Evolutionary information for specifying a protein fold , 2005, Nature.

[7]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[8]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[9]  Erik van Nimwegen,et al.  Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments , 2010, PLoS Comput. Biol..

[10]  Cristina Marino Buslje,et al.  Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information , 2009, Bioinform..

[11]  Chris Bailey-Kellogg,et al.  Graphical Models of Residue Coupling in Protein Families , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[13]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[14]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[15]  Osvaldo Graña,et al.  Assessment of domain boundary predictions and the prediction of intramolecular contacts in CASP8 , 2009, Proteins.

[16]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[17]  Robert Tibshirani,et al.  Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods , 2009, J. Mach. Learn. Res..

[18]  T. Smith,et al.  Modeling protein cores with Markov random fields. , 1994, Mathematical biosciences.

[19]  Nikolay A. Kolchanov,et al.  CRASP: a program for analysis of coordinated substitutions in multiple alignments of protein sequences , 2004, Nucleic Acids Res..

[20]  Amos Bairoch,et al.  PROSITE, a protein domain database for functional characterization and annotation , 2009, Nucleic Acids Res..

[21]  Aalt DJ van Dijk,et al.  Conserved and variable correlated mutations in the plant MADS protein network , 2010, BMC Genomics.

[22]  Jianlin Cheng,et al.  NNcon: improved protein contact map prediction using 2D-recursive neural networks , 2009, Nucleic Acids Res..

[23]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[24]  L. C. Martin,et al.  Using information theory to search for co-evolving residues in proteins , 2005, Bioinform..

[25]  Najeeb M. Halabi,et al.  Protein Sectors: Evolutionary Units of Three-Dimensional Structure , 2009, Cell.

[26]  Chris Bailey-Kellogg,et al.  Graphical models of protein–protein interaction specificity from correlated mutations and interaction data , 2009, Proteins.

[27]  Alfonso Valencia,et al.  Assessment of intramolecular contact predictions for CASP7 , 2007, Proteins.

[28]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[29]  Frank Neven,et al.  SLIDER: A Generic Metaheuristic for the Discovery of Correlated Motifs in Protein-Protein Interaction Networks , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Roeland C. H. J. van Ham,et al.  Sequence Motifs in MADS Transcription Factors Responsible for Specificity and Diversification of Protein-Protein Interaction , 2010, PLoS Comput. Biol..

[31]  J. Besag On the Statistical Analysis of Dirty Pictures , 1986 .

[32]  Yiannis Kourmpetis,et al.  Bayesian Markov Random Field Analysis for Protein Function Prediction Based on Network Data , 2010, PloS one.

[33]  Chris Sander,et al.  A Specificity Map for the PDZ Domain Family , 2008, PLoS biology.

[34]  Christopher M. Summa,et al.  An atomic environment potential for use in protein structure prediction. , 2005, Journal of molecular biology.

[35]  G. Gloor,et al.  Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. , 2005, Biochemistry.

[36]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[37]  Richard W. Aldrich,et al.  A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments , 2004, Bioinform..

[38]  Roeland C. H. J. van Ham,et al.  Conserved and variable correlated mutations in the plant MADS protein network , 2010 .

[39]  A. Horovitz,et al.  Detection and reduction of evolutionary noise in correlated mutation analysis. , 2005, Protein engineering, design & selection : PEDS.

[40]  Cristina Marino Buslje,et al.  Networks of High Mutual Information Define the Structural Proximity of Catalytic Sites: Implications for Catalytic Residue Identification , 2010, PLoS Comput. Biol..