Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict residue-residue contacts

MOTIVATION Correct prediction of residue-residue contacts in proteins that lack good templates with known structure would take ab initio protein structure prediction a large step forward. The lack of correct contacts, and in particular long-range contacts, is considered the main reason why these methods often fail. RESULTS We propose a novel hidden Markov model (HMM)-based method for predicting residue-residue contacts from protein sequences using as training data homologous sequences, predicted secondary structure and a library of local neighborhoods (local descriptors of protein structure). The library consists of recurring structural entities incorporating short-, medium- and long-range interactions and is general enough to reassemble the cores of nearly all proteins in the PDB. The method is tested on an external test set of 606 domains with no significant sequence similarity to the training set as well as 151 domains with SCOP folds not present in the training set. Considering the top 0.2 x L predictions (L = sequence length), our HMMs obtained an accuracy of 22.8% for long-range interactions in new fold targets, and an average accuracy of 28.6% for long-, medium- and short-range contacts. This is a significant performance increase over currently available methods when comparing against results published in the literature. AVAILABILITY http://predictioncenter.org/Services/FragHMMent/.

[1]  Yang Zhang,et al.  A comprehensive assessment of sequence-based and template-based methods for protein contact prediction , 2008, Bioinform..

[2]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[3]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[4]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[5]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[6]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[7]  S. Pietrokovski,et al.  A pair‐to‐pair amino acids substitution matrix and its applications for protein structure prediction , 2007, Proteins.

[8]  J. Skolnick,et al.  TOUCHSTONE II: a new approach to ab initio protein structure prediction. , 2003, Biophysical journal.

[9]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[10]  J. Skolnick,et al.  Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm , 2004, Proteins.

[11]  Kevin Karplus,et al.  Contact prediction using mutual information and neural nets , 2007, Proteins.

[12]  Alfonso Valencia,et al.  Assessment of intramolecular contact predictions for CASP7 , 2007, Proteins.

[13]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[14]  Alessandro Vullo,et al.  A two-stage approach for improved prediction of residue contact maps , 2006, BMC Bioinformatics.

[15]  Anna Tramontano Of men and machines , 2003, Nature Structural Biology.

[16]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17]  H. Wolfson,et al.  Correlated mutations: Advances and limitations. A study on fusion proteins and on the Cohesin‐Dockerin families , 2006, Proteins.

[18]  C. Sander,et al.  Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? , 1994, Protein engineering.

[19]  Yang Zhang Progress and challenges in protein structure prediction. , 2008, Current opinion in structural biology.

[20]  Joaquín Dopazo,et al.  PupasView: a visual tool for selecting suitable SNPs, with putative pathological effect in genes, for genotyping purposes , 2005, Nucleic Acids Res..

[21]  Prasanna R Kolatkar,et al.  Assessment of CASP7 structure predictions for template free targets , 2007, Proteins.

[22]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[23]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  A. Valencia,et al.  Improving contact predictions by the combination of correlated mutations and other sources of sequence information. , 1997, Folding & design.

[25]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[26]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[27]  David E. Kim,et al.  Physically realistic homology models built with ROSETTA can be more accurate than their templates. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[28]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[29]  K. Burrage,et al.  Protein contact prediction using patterns of correlation , 2004, Proteins.

[30]  Krzysztof Fidelis,et al.  Local descriptors of protein structure: A systematic analysis of the sequence‐structure relationship in proteins using short‐ and long‐range interactions , 2009, Proteins.

[31]  Gavin C. Cawley,et al.  Efficient leave-one-out cross-validation of kernel fisher discriminant classifiers , 2003, Pattern Recognit..

[32]  Christodoulos A. Floudas,et al.  Advances in protein structure prediction and de novo protein design : A review , 2006 .

[33]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[34]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[35]  Emil Alexov,et al.  Predicting residue contacts using pragmatic correlated mutations method: reducing the false positives , 2006, BMC Bioinformatics.

[36]  Janusz M Bujnicki,et al.  Protein‐Structure Prediction by Recombination of Fragments , 2006, Chembiochem : a European journal of chemical biology.