Towards Solving the Inverse Protein Folding Problem

Accurately assigning folds for divergent protein sequences is a major obstacle to structural studies and underlies the inverse protein folding problem. Herein, we outline our theories for fold-recognition in the "twilight-zone" of sequence similarity (<25% identity). Our analyses demonstrate that structural sequence profiles built using Position-Specific Scoring Matrices (PSSMs) significantly outperform multiple popular homology-modeling algorithms for relating and predicting structures given only their amino acid sequences. Importantly, structural sequence profiles reconstitute SCOP fold classifications in control and test datasets. Results from our experiments suggest that structural sequence profiles can be used to rapidly annotate protein folds at proteomic scales. We propose that encoding the entire Protein DataBank (~1070 folds) into structural sequence profiles would extract interoperable information capable of improving most if not all methods of structural modeling.

[1]  G. Bhardwaj,et al.  Brainstorming through the Sequence Universe: Theories on the Protein Problem , 2009, 0911.0652.

[2]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[3]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[4]  D. Baker,et al.  A surprising simplicity to protein folding , 2000, Nature.

[5]  Richard Hughey,et al.  SAM‐T04: What is new in protein–structure prediction for CASP6 , 2005, Proteins.

[6]  S. Henikoff,et al.  Embedding strategies for effective use of information from multiple sequence alignments , 1997, Protein science : a publication of the Protein Society.

[7]  Lode Wyns,et al.  SABmark- a benchmark for sequence alignment that covers the entire known fold space , 2005, Bioinform..

[8]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[9]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[10]  M. Karplus,et al.  Evaluation of comparative protein modeling by MODELLER , 1995, Proteins.

[11]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[12]  N. Grishin Fold change in evolution of protein structures. , 2001, Journal of structural biology.

[13]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[14]  M C Peitsch,et al.  ProMod and Swiss-Model: Internet-based tools for automated comparative protein modelling. , 1996, Biochemical Society transactions.

[15]  Leszek Rychlewski,et al.  FFAS03: a server for profile–profile sequence alignments , 2005, Nucleic Acids Res..

[16]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[17]  Department of Materials Science,et al.  Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure, Function and Evolution , 2008, 0806.2394.

[18]  Dongwon Lee,et al.  Adaptive BLASTing through the Sequence Dataspace: Theories on Protein Sequence Embedding , 2009, 0911.0650.

[19]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.