Linear predictive coding representation of correlated mutation for protein sequence alignment

Although both conservation and correlated mutation (CM) are important information reflecting the different sorts of context in multiple sequence alignment, most of alignment methods use sequence profiles that only represent conservation. There is no general way to represent correlated mutation and incorporate it with sequence alignment yet. We develop a novel method, CM profile, to represent correlated mutation as the spectral feature derived by using linear predictive coding where correlated mutations among different positions are represented by a fixed number of values. We combine CM profile with conventional sequence profile to improve alignment quality. For distantly related protein pairs, using CM profile improves the profile-profile alignment with or without predicted secondary structure. Especially, at superfamily level, combining CM profile with sequence profile improves profile-profile alignment by 9.5% while predicted secondary structure does by 6.0%. More significantly, using both of them improves profile-profile alignment by 13.9%. We also exemplify the effectiveness of CM profile by demonstrating that the resulting alignment preserves share coevolution and contacts. Because of the generality of CM profile, it can be used for other bioinformatics applications in the same way of using sequence profile.

[1]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[2]  Arne Elofsson,et al.  Improved alignment quality by combining evolutionary information, predicted secondary structure and self-organizing maps , 2006, BMC Bioinform..

[3]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.

[4]  W. Atchley,et al.  Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. , 2000, Molecular biology and evolution.

[5]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[6]  Byung-chul Lee,et al.  Analysis of the residue–residue coevolution network and the functionally important residues in proteins , 2008, Proteins.

[7]  Yuan Qi,et al.  A comprehensive system for evaluation of remote sequence similarity detection , 2007, BMC Bioinformatics.

[8]  E. Neher How frequent are correlated changes in families of protein sequences? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Kevin Karplus,et al.  Contact prediction using mutual information and neural nets , 2007, Proteins.

[10]  C. Sander,et al.  Correlated Mutations and Residue Contacts , 1994 .

[11]  Richard W. Aldrich,et al.  A perturbation-based method for calculating explicit likelihood of evolutionary co-variance in multiple sequence alignments , 2004, Bioinform..

[12]  B. Rost,et al.  Effective use of sequence correlation and conservation in fold recognition. , 1999, Journal of molecular biology.

[13]  Minho Lee,et al.  Predicting and improving the protein sequence alignment quality by support vector regression , 2007, BMC Bioinformatics.

[14]  William R Taylor,et al.  Using scores derived from statistical coupling analysis to distinguish correct and incorrect folds in de‐novo protein structure prediction , 2008, Proteins.

[15]  Thomas W. H. Lui,et al.  Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments , 2003, Bioinform..

[16]  Torsten Schwede,et al.  Assessment of CASP7 predictions for template‐based modeling targets , 2007, Proteins.

[17]  Jaap Heringa,et al.  Contact-based sequence alignment. , 2004, Nucleic acids research.

[18]  Arne Elofsson,et al.  A study on protein sequence alignment quality , 2002, Proteins.

[19]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[20]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[21]  B Honig,et al.  An integrated approach to the analysis and modeling of protein sequences and structures. II. On the relationship between sequence and structural similarity for proteins that are not obviously related in sequence. , 2000, Journal of molecular biology.

[22]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[23]  D. Cozzetto,et al.  Relationship between multiple sequence alignments and quality of protein comparative models , 2004, Proteins.

[24]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[25]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[26]  Sitao Wu,et al.  MUSTER: Improving protein sequence profile–profile alignments by using multiple sources of structure information , 2008, Proteins.

[27]  Arne Elofsson,et al.  MaxSub: an automated measure for the assessment of protein structure prediction quality , 2000, Bioinform..

[28]  Tuan D. Pham,et al.  Spectral distortion measures for biological sequence comparisons and database searching , 2007, Pattern Recognit..

[29]  Cristina Marino Buslje,et al.  Correction for phylogeny, small number of observations and data redundancy improves the identification of coevolving amino acid pairs using mutual information , 2009, Bioinform..