A machine learning information retrieval approach to protein fold recognition.

MOTIVATION Recognizing proteins that have similar tertiary structure is the key step of template-based protein structure prediction methods. Traditionally, a variety of alignment methods are used to identify similar folds, based on sequence similarity and sequence-structure compatibility. Although these methods are complementary, their integration has not been thoroughly exploited. Statistical machine learning methods provide tools for integrating multiple features, but so far these methods have been used primarily for protein and fold classification, rather than addressing the retrieval problem of fold recognition-finding a proper template for a given query protein. RESULTS Here we present a two-stage machine learning, information retrieval, approach to fold recognition. First, we use alignment methods to derive pairwise similarity features for query-template protein pairs. We also use global profile-profile alignments in combination with predicted secondary structure, relative solvent accessibility, contact map and beta-strand pairing to extract pairwise structural compatibility features. Second, we apply support vector machines to these features to predict the structural relevance (i.e. in the same fold or not) of the query-template pairs. For each query, the continuous relevance scores are used to rank the templates. The FOLDpro approach is modular, scalable and effective. Compared with 11 other fold recognition methods, FOLDpro yields the best results in almost all standard categories on a comprehensive benchmark dataset. Using predictions of the top-ranked template, the sensitivity is approximately 85, 56, and 27% at the family, superfamily and fold levels respectively. Using the 5 top-ranked templates, the sensitivity increases to 90, 70, and 48%.

[1]  M. Sternberg,et al.  Enhanced genome annotation using structural profiles in the program 3D-PSSM. , 2000, Journal of molecular biology.

[2]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[3]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  Leszek Rychlewski,et al.  Fold prediction by a hierarchy of sequence, threading, and modeling methods , 1998, Protein science : a publication of the Protein Society.

[6]  J Skolnick,et al.  Defrosting the frozen approximation: PROSPECTOR— A new approach to threading , 2001, Proteins.

[7]  Marcin von Grotthuss,et al.  ORFeus: detection of distant homology using sequence profiles and predicted secondary structure , 2003, Nucleic Acids Res..

[8]  D. Fischer,et al.  A study of combined structure/sequence profiles. , 1996, Folding & design.

[9]  B. Rost,et al.  Critical assessment of methods of protein structure prediction (CASP)—Round 6 , 2005, Proteins.

[10]  S. B. Needleman,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 1989 .

[11]  Sam Griffiths-Jones,et al.  The use of structure information to increase alignment accuracy does not aid homologue detection with profile HMMs , 2002, Bioinform..

[12]  Pierre Baldi,et al.  Three-stage prediction of protein ?-sheets by neural networks, alignments and graph algorithms , 2005, ISMB.

[13]  Roland L Dunbrack,et al.  Scoring profile‐to‐profile sequence alignments , 2004, Protein science : a publication of the Protein Society.

[14]  S. Bryant,et al.  Critical assessment of methods of protein structure prediction (CASP): Round II , 1997, Proteins.

[15]  D Fischer,et al.  Hybrid fold recognition: combining sequence derived properties with evolutionary information. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[16]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[17]  M S Waterman,et al.  Sequence alignment and penalty choice. Review of concepts, case studies and implications. , 1994, Journal of molecular biology.

[18]  B. Honig,et al.  On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles. , 2003, Journal of molecular biology.

[19]  P. Baldi,et al.  Prediction of coordination number and relative solvent accessibility in proteins , 2002, Proteins.

[20]  M. O. Dayhoff,et al.  Establishing homologies in protein sequences. , 1983, Methods in enzymology.

[21]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[22]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[23]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[24]  Ying Xu,et al.  An Efficient Computational Method for Globally Optimal Threading , 1998, J. Comput. Biol..

[25]  A. Elofsson,et al.  Hidden Markov models that use predicted secondary structures for fold recognition , 1999, Proteins.

[26]  Pierre Baldi,et al.  Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles , 2002, Proteins.

[27]  M. Madera,et al.  A comparison of profile hidden Markov model procedures for remote homology detection. , 2002, Nucleic acids research.

[28]  Mike Hu Methodologies for Constructing and Training Large Hierarchical Hidden Markov Models for Sequence Analysis , 2001 .

[29]  Dong Xu,et al.  PROSPECT II: protein structure prediction program for genome-scale applications. , 2003, Protein engineering.

[30]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[31]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[32]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[33]  R. Abagyan,et al.  Recognition of distantly related proteins through energy calculations , 1994, Proteins.

[34]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[35]  Arne Elofsson,et al.  Using evolutionary information for the query and target improves fold recognition , 2004, Proteins.

[36]  S. Bryant,et al.  An empirical energy function for threading protein sequence through the folding motif , 1993, Proteins.

[37]  R B Russell,et al.  Fold recognition from sequence comparisons , 2001, Proteins.

[38]  Arne Elofsson,et al.  3D-Jury: A Simple Approach to Improve Protein Structure Predictions , 2003, Bioinform..

[39]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[40]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[41]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[42]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[43]  Kimmen Sjölander,et al.  SATCHMO: Sequence Alignment and Tree Construction Using Hidden Markov Models , 2003, Bioinform..

[44]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Hongyi Zhou,et al.  Single‐body residue‐level knowledge‐based energy score combined with sequence‐profile and secondary structure information for fold recognition , 2004, Proteins.

[46]  Michael Gribskov,et al.  Score Distributions for Simultaneous Matching to Multiple Motifs , 1997, J. Comput. Biol..

[47]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[48]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[49]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[50]  Cédric Notredame,et al.  3DCoffee: combining protein sequences and structures within multiple sequence alignments. , 2004, Journal of molecular biology.

[51]  A. Sali,et al.  Alignment of protein sequences by their profiles , 2004, Protein science : a publication of the Protein Society.

[52]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[53]  Kimmen Sjölander,et al.  COACH : profile-profile alignment of protein families using hidden Markov models , 2003 .

[54]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[55]  Daniel Fischer,et al.  3D‐SHOTGUN: A novel, cooperative, fold‐recognition meta‐predictor , 2003, Proteins.

[56]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[57]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[58]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[59]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[60]  Ralf Zimmer,et al.  Profile-Profile Alignment: A Powerful Tool for Protein Structure Prediction , 2002, Pacific Symposium on Biocomputing.

[61]  Yang Liu,et al.  A Study of Hidden Markov Model , 2004 .

[63]  Arne Elofsson,et al.  Profile–profile methods provide improved fold‐recognition: A study of different profile–profile alignment methods , 2004, Proteins.

[64]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[65]  B Honig,et al.  Combining multiple structure and sequence alignments to improve sequence detection and alignment: Application to the SH2 domains of Janus kinases , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[66]  I W Hunter,et al.  3D-1D threading methods for protein fold recognition. , 2000, Pharmacogenomics.

[67]  Liam J. McGuffin,et al.  Improving sequence-based fold recognition by using 3D model quality assessment , 2005, Bioinform..

[68]  A. Godzik,et al.  Sequence-structure matching in globular proteins: application to supersecondary and tertiary structure determination. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[69]  Nick V. Grishin,et al.  Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments , 2003, Bioinform..

[70]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[71]  Pierre Baldi,et al.  Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners , 2002, ISMB.

[72]  Kimmen Sjölander,et al.  Simultaneous Sequence Alignment and Tree Construction Using Hidden Markov Models , 2002, Pacific Symposium on Biocomputing.

[73]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[74]  A G Murzin,et al.  Distant homology recognition using structural classification of proteins , 1997, Proteins.

[75]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[76]  Hongyi Zhou,et al.  Fold recognition by combining sequence profiles derived from evolution and from depth‐dependent structural alignment of fragments , 2004, Proteins.

[77]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[78]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[79]  A. Panchenko,et al.  Combination of threading potentials and sequence profiles improves fold recognition. , 2000, Journal of molecular biology.

[80]  Y Shan,et al.  Fold recognition and accurate query‐template alignment by a combination of PSI‐BLAST and threading , 2001, Proteins.

[81]  B. Rost,et al.  Protein fold recognition by prediction-based threading. , 1997, Journal of molecular biology.

[82]  J Lundström,et al.  Pcons: A neural‐network–based consensus predictor that improves fold recognition , 2001, Protein science : a publication of the Protein Society.

[83]  M J Sippl,et al.  Structure-based evaluation of sequence comparison and fold recognition alignment accuracy. , 2000, Journal of molecular biology.

[84]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[85]  A Valencia,et al.  A neural network approach to evaluate fold recognition results , 2003, Proteins.

[86]  Jinbo Xu Protein Structure Prediction by Linear Programming , 2003 .

[87]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[88]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..