Fold recognition by combining profile-profile alignment and support vector machine

MOTIVATION Currently, the most accurate fold-recognition method is to perform profile-profile alignments and estimate the statistical significances of those alignments by calculating Z-score or E-value. Although this scheme is reliable in recognizing relatively close homologs related at the family level, it has difficulty in finding the remote homologs that are related at the superfamily or fold level. RESULTS In this paper, we present an alternative method to estimate the significance of the alignments. The alignment between a query protein and a template of length n in the fold library is transformed into a feature vector of length n + 1, which is then evaluated by support vector machine (SVM). The output from SVM is converted to a posterior probability that a query sequence is related to a template, given SVM output. Results show that a new method shows significantly better performance than PSI-BLAST and profile-profile alignment with Z-score scheme. While PSI-BLAST and Z-score scheme detect 16 and 20% of superfamily-related proteins, respectively, at 90% specificity, a new method detects 46% of these proteins, resulting in more than 2-fold increase in sensitivity. More significantly, at the fold level, a new method can detect 14% of remotely related proteins at 90% specificity, a remarkable result considering the fact that the other methods can detect almost none at the same level of specificity.

[1]  Ying Xu,et al.  Raptor: Optimal Protein Threading by Linear Programming , 2003, J. Bioinform. Comput. Biol..

[2]  Arne Elofsson,et al.  Profile–profile methods provide improved fold‐recognition: A study of different profile–profile alignment methods , 2004, Proteins.

[3]  T L Blundell,et al.  FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. , 2001, Journal of molecular biology.

[4]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[5]  Lisa N Kinch,et al.  CASP5 assessment of fold recognition target predictions , 2003, Proteins.

[6]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[7]  Thomas Lengauer,et al.  Arby: automatic protein structure prediction using profile-profile alignment and confidence measures , 2004, Bioinform..

[8]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[9]  Arne Elofsson,et al.  A study on protein sequence alignment quality , 2002, Proteins.

[10]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[11]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[12]  A. Godzik,et al.  The interplay of fold recognition and experimental structure determination in structural genomics. , 2004, Current opinion in structural biology.

[13]  Alfonso Valencia,et al.  Predicting reliable regions in protein alignments from sequence profiles. , 2003, Journal of molecular biology.

[14]  Sung-Hou Kim,et al.  A global representation of the protein fold space , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Tong Liu,et al.  The Status of Structural Genomics Defined Through the Analysis of Current Targets and Structures , 2003, Pacific Symposium on Biocomputing.

[16]  K Karplus,et al.  Predicting protein structure using only sequence information , 1999, Proteins.

[17]  David C. Jones,et al.  GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. , 1999, Journal of molecular biology.

[18]  Liam J. McGuffin,et al.  The Genomic Threading Database: a comprehensive resource for structural annotations of the genomes from key organisms , 2004, Nucleic Acids Res..

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[21]  M. Gerstein,et al.  Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. , 2001, Journal of molecular biology.

[22]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[23]  Wynne Hsu,et al.  Remote homolog detection using local sequence–structure correlations , 2004, Proteins.

[24]  C. Chothia,et al.  Intermediate sequences increase the detection of homology between sequences. , 1997, Journal of molecular biology.

[25]  Steven E. Brenner,et al.  Target selection for structural genomics , 2000, Nature Structural Biology.

[26]  Burkhard Rost,et al.  Improving fold recognition without folds. , 2004, Journal of molecular biology.

[27]  Jason Weston,et al.  Protein ranking: from local to global structure in the protein similarity network. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[28]  John D. Westbrook,et al.  TargetDB: a target registration database for structural genomics projects , 2004, Bioinform..

[29]  M. Sternberg,et al.  Enhanced genome annotation using structural profiles in the program 3D-PSSM. , 2000, Journal of molecular biology.

[30]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[31]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[32]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Sung-Hou Kim Shining a light on structural genomics , 1998, Nature Structural Biology.

[34]  Y Xu,et al.  Protein threading using PROSPECT: Design and evaluation , 2000, Proteins.

[35]  Dong Xu,et al.  PROSPECT II: protein structure prediction program for genome-scale applications. , 2003, Protein engineering.

[36]  Arne Elofsson,et al.  Using evolutionary information for the query and target improves fold recognition , 2004, Proteins.

[37]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[38]  Philip E. Bourne The status of structural genomics , 2003 .

[39]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[40]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[41]  Mong-Li Lee,et al.  Efficient remote homology detection using local structure , 2003, Bioinform..