Computational Protein Design: Validation and Possible Relevance as a Tool for Homology Searching and Fold Recognition

Background Protein fold recognition usually relies on a statistical model of each fold; each model is constructed from an ensemble of natural sequences belonging to that fold. A complementary strategy may be to employ sequence ensembles produced by computational protein design. Designed sequences can be more diverse than natural sequences, possibly avoiding some limitations of experimental databases. Methodology/Principal Findings We explore this strategy for four SCOP families: Small Kunitz-type inhibitors (SKIs), Interleukin-8 chemokines, PDZ domains, and large Caspase catalytic subunits, represented by 43 structures. An automated procedure is used to redesign the 43 proteins. We use the experimental backbones as fixed templates in the folded state and a molecular mechanics model to compute the interaction energies between sidechain and backbone groups. Calculations are done with the Proteins@Home volunteer computing platform. A heuristic algorithm is used to scan the sequence and conformational space, yielding 200,000–300,000 sequences per backbone template. The results confirm and generalize our earlier study of SH2 and SH3 domains. The designed sequences ressemble moderately-distant, natural homologues of the initial templates; e.g., the SUPERFAMILY, profile Hidden-Markov Model library recognizes 85% of the low-energy sequences as native-like. Conversely, Position Specific Scoring Matrices derived from the sequences can be used to detect natural homologues within the SwissProt database: 60% of known PDZ domains are detected and around 90% of known SKIs and chemokines. Energy components and inter-residue correlations are analyzed and ways to improve the method are discussed. Conclusions/Significance For some families, designed sequences can be a useful complement to experimental ones for homologue searching. However, improved tools are needed to extract more information from the designed profiles before the method can be of general use.

[1]  Ian Sillitoe,et al.  Assessing strategies for improved superfamily recognition , 2005, Protein science : a publication of the Protein Society.

[2]  W. P. Russ,et al.  Evolutionary information for specifying a protein fold , 2005, Nature.

[3]  Hongyi Zhou,et al.  Fold recognition by combining sequence profiles derived from evolution and from depth‐dependent structural alignment of fragments , 2004, Proteins.

[4]  Chris Sander,et al.  A Specificity Map for the PDZ Domain Family , 2008, PLoS biology.

[5]  Jeffery G Saven,et al.  Computational protein design: structure, function and combinatorial diversity. , 2007, Current opinion in chemical biology.

[6]  Salvador Ventura,et al.  Designing proteins from the inside out , 2004, Proteins.

[7]  Thomas Simonson,et al.  Testing the Coulomb/Accessible Surface Area solvent model for protein stability, ligand binding, and protein design , 2008, BMC Bioinformatics.

[8]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[9]  David Mignon,et al.  Computational protein design as a tool for fold recognition , 2009, Proteins.

[10]  R. Levy,et al.  Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[11]  N. Grishin,et al.  PROCAIN: protein profile comparison with assisting information , 2009, Nucleic acids research.

[12]  S J Wodak,et al.  Automatic protein design with all atom force-fields by exact and heuristic optimization. , 2000, Journal of molecular biology.

[13]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[14]  Tim J. P. Hubbard,et al.  SCOP database in 2004: refinements integrate structure and sequence family data , 2004, Nucleic Acids Res..

[15]  Lorenz Wernisch,et al.  Folding free energy function selects native-like protein sequences in the core but not on the surface , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  C Venclovas,et al.  Comparative modeling of CASP4 target proteins: Combining results of sequence search with three‐dimensional structure assessment , 2001, Proteins.

[17]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[18]  H. Wolfson,et al.  Correlated mutations: Advances and limitations. A study on fusion proteins and on the Cohesin‐Dockerin families , 2006, Proteins.

[19]  Russell L. Marsden,et al.  Progress of structural genomics initiatives: an analysis of solved target structures. , 2005, Journal of molecular biology.

[20]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[21]  Vijay S Pande,et al.  Increased detection of structural templates using alignments of designed sequences , 2003, Proteins.

[22]  Vijay S Pande,et al.  Sequence optimization for native state stability determines the evolution and folding kinetics of a small protein. , 2003, Journal of molecular biology.

[23]  John B. Anderson,et al.  CDD: a Conserved Domain Database for protein classification , 2004, Nucleic Acids Res..

[24]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[25]  A Wlodawer,et al.  Comparison of two highly refined structures of bovine pancreatic trypsin inhibitor. , 1987, Journal of molecular biology.

[26]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[27]  R. Lavery,et al.  A new approach to the rapid determination of protein side chain conformations. , 1991, Journal of biomolecular structure & dynamics.

[28]  A. Elcock Prediction of functionally important residues based solely on the computed energetics of protein structure. , 2001, Journal of molecular biology.

[29]  T. Pawson,et al.  The Carboxyl Terminus of B Class Ephrins Constitutes a PDZ Domain Binding Motif* , 1999, The Journal of Biological Chemistry.

[30]  Steven E Brenner,et al.  The Impact of Structural Genomics: Expectations and Outcomes , 2005, Science.

[31]  M. Levitt,et al.  De novo protein design. II. Plasticity in sequence space. , 1999, Journal of molecular biology.

[32]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[33]  J R Desjarlais,et al.  Side-chain and backbone flexibility in protein core design. , 1999, Journal of molecular biology.

[34]  S. L. Mayo,et al.  Protein design automation , 1996, Protein science : a publication of the Protein Society.

[35]  Eric J. Deeds,et al.  Understanding ensemble protein folding at atomic detail , 2006, Proceedings of the National Academy of Sciences.

[36]  Frances M. G. Pearl,et al.  The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis , 2004, Nucleic Acids Res..

[37]  Loren L Looger,et al.  Computational Design of a Biologically Active Enzyme , 2004, Science.

[38]  Robert D. Finn,et al.  Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins , 1999, Nucleic Acids Res..

[39]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[40]  Thomas Simonson,et al.  Computational sidechain placement and protein mutagenesis with implicit solvent models , 2007, Proteins.

[41]  Aysam Guerler,et al.  Novel protein folds and their nonsequential structural analogs , 2008, Protein science : a publication of the Protein Society.

[42]  M. Levitt,et al.  Structure-based conformational preferences of amino acids. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[43]  E. Shakhnovich,et al.  Understanding hierarchical protein evolution from first principles. , 2001, Journal of molecular biology.

[44]  Axel T. Brunger,et al.  X-PLOR Version 3.1: A System for X-ray Crystallography and NMR , 1992 .

[45]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[46]  Xiaoran Fu Stowell,et al.  Design of functional ferritin-like proteins with hydrophobic cavities. , 2006, Journal of the American Chemical Society.

[47]  Manuel C. Peitsch,et al.  SWISS-MODEL: an automated protein homology-modeling server , 2003, Nucleic Acids Res..

[48]  Shoshana J. Wodak,et al.  Recognizing protein–protein interfaces with empirical potentials and reduced amino acid alphabets , 2007, BMC Bioinformatics.

[49]  D. Baker,et al.  Design of a Novel Globular Protein Fold with Atomic-Level Accuracy , 2003, Science.

[50]  Ke Fan,et al.  The number of protein folds and their distribution over families in nature , 2004, Proteins.

[51]  M. Levitt Nature of the protein universe , 2009, Proceedings of the National Academy of Sciences.

[52]  Christopher T. Saunders,et al.  Recapitulation of protein family divergence using flexible backbone protein design. , 2005, Journal of molecular biology.

[53]  David A. Lee,et al.  PSI-2: structural genomics to cover protein domain family space. , 2009, Structure.

[54]  N. Pokala,et al.  Energy functions for protein design: adjustment with protein-protein complex affinities, models for the unfolded state, and negative design of solubility and specificity. , 2005, Journal of molecular biology.

[55]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[56]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[57]  P. Harbury,et al.  Automated design of specificity in molecular recognition , 2003, Nature Structural Biology.

[58]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[59]  Thomas Simonson,et al.  Computational protein design: Software implementation, parameter optimization, and performance of a simple model , 2008, J. Comput. Chem..

[60]  Gail J. Bartlett,et al.  Effective function annotation through catalytic residue conservation. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[61]  B. Lee,et al.  The interpretation of protein structures: estimation of static accessibility. , 1971, Journal of molecular biology.

[62]  Andrew M Wollacott,et al.  Prediction of amino acid sequence from structure , 2000, Protein science : a publication of the Protein Society.

[63]  D. Baker,et al.  A large scale test of computational protein design: folding and stability of nine completely redesigned globular proteins. , 2003, Journal of molecular biology.

[64]  K Karplus,et al.  Predicting protein structure using only sequence information , 1999, Proteins.

[65]  P. S. Kim,et al.  High-resolution protein design with backbone freedom. , 1998, Science.

[66]  Cyrus Chothia,et al.  The SUPERFAMILY database in 2004: additions and improvements , 2004, Nucleic Acids Res..

[67]  A. Panchenko,et al.  A comparison of position‐specific score matrices based on sequence and structure alignments , 2002, Protein science : a publication of the Protein Society.

[68]  J. Ponder,et al.  Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. , 1987, Journal of molecular biology.

[69]  A. Einstein,et al.  Inside out , 1991, Nature.

[70]  Andrew M Wollacott,et al.  Prediction of structures of multidomain proteins from structures of the individual domains , 2006, Protein science : a publication of the Protein Society.

[71]  Nikos Kyrpides,et al.  The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide , 2005, Nucleic Acids Res..

[72]  Kimberly A. Reynolds,et al.  An object‐oriented library for computational protein design , 2007, J. Comput. Chem..

[73]  Vijay S Pande,et al.  Thoroughly sampling sequence space: Large‐scale protein design of structural ensembles , 2002, Protein science : a publication of the Protein Society.

[74]  Feng Ding,et al.  Correction: Emergence of Protein Fold Families through Rational Design , 2006, PLoS Comput. Biol..

[75]  M. Levitt,et al.  De novo protein design. I. In search of stability and specificity. , 1999, Journal of molecular biology.

[76]  D. Baker,et al.  Native protein sequences are close to optimal for their structures. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[77]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[78]  Navin Pokala,et al.  Energy functions for protein design I: Efficient and accurate continuum electrostatics and solvation , 2004, Protein science : a publication of the Protein Society.

[79]  S L Mayo,et al.  Pairwise calculation of protein solvent-accessible surface areas. , 1998, Folding & design.

[80]  M. Madera,et al.  A comparison of profile hidden Markov model procedures for remote homology detection. , 2002, Nucleic acids research.

[81]  Jean-François Gibrat,et al.  FROST: A filter‐based fold recognition method , 2002, Proteins.

[82]  David P. Anderson,et al.  BOINC: a system for public-resource computing and storage , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[83]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[84]  F M Richards,et al.  Optimal sequence selection in proteins of known structure by simulated evolution. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[85]  B K Shoichet,et al.  A relationship between protein stability and protein function. , 1995, Proceedings of the National Academy of Sciences of the United States of America.