Predicting Protein Folds with Fold-Specific PSSM Libraries

Accurately assigning folds for divergent protein sequences is a major obstacle to structural studies. Herein, we outline an effective method for fold recognition using sets of PSSMs, each of which is constructed for different protein folds. Our analyses demonstrate that FSL (Fold-specific Position Specific Scoring Matrix Libraries) can predict/relate structures given only their amino acid sequences of highly divergent proteins. This ability to detect distant relationships is dependent on low-identity sequence alignments obtained from FSL. Results from our experiments demonstrate that FSL perform well in recognizing folds from the “twilight-zone” SABmark dataset. Further, this method is capable of accurate fold prediction in newly determined structures. We suggest that by building complete PSSM libraries for all unique folds within the Protein Database (PDB), FSL can be used to rapidly and reliably annotate a large subset of protein folds at proteomic level. The related programs and fold-specific PSSMs for our FSL are publicly available at: http://ccp.psu.edu/download/FSLv1.0/.

[1]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[2]  Dongwon Lee,et al.  Adaptive BLASTing through the Sequence Dataspace: Theories on Protein Sequence Embedding , 2009, 0911.0650.

[3]  N. Grishin Fold change in evolution of protein structures. , 2001, Journal of structural biology.

[4]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[5]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[6]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[7]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[8]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[9]  S. Henikoff,et al.  Embedding strategies for effective use of information from multiple sequence alignments , 1997, Protein science : a publication of the Protein Society.

[10]  Department of Materials Science,et al.  Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure, Function and Evolution , 2008, 0806.2394.

[11]  Ming Li,et al.  Assessment of RAPTOR's linear programming approach in CAFASP3 , 2003, Proteins.

[12]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using MODELLER , 2007, Current protocols in protein science.

[13]  Leszek Rychlewski,et al.  FFAS03: a server for profile–profile sequence alignments , 2005, Nucleic Acids Res..

[14]  Ben M. Webb,et al.  Comparative Protein Structure Modeling Using Modeller , 2006, Current protocols in bioinformatics.

[16]  Jianlin Cheng,et al.  Prediction of global and local quality of CASP8 models by MULTICOM series , 2009, Proteins.

[17]  Richard Hughey,et al.  SAM‐T04: What is new in protein–structure prediction for CASP6 , 2005, Proteins.

[18]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[19]  G. Bhardwaj,et al.  Brainstorming through the Sequence Universe: Theories on the Protein Problem , 2009, 0911.0652.

[20]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[21]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[22]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[23]  D. Baker,et al.  A surprising simplicity to protein folding , 2000, Nature.

[24]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.