A computational pipeline for protein structure prediction and analysis at genome scale

Traditionally, protein 3D structures are solved using experimental techniques, like X-ray crystallography or nuclear magnetic resonance (NMR). While these experimental techniques have been the main workhorse for protein structure studies in the past few decades, it is becoming increasingly apparent that they alone cannot keep up with the production rate of protein sequences. Fortunately, computational techniques for protein structure predictions have matured to such a level that they can complement the existing experimental techniques. In this paper, we present an automated pipeline for protein structure prediction. The centerpiece of the pipeline is a threading-based protein structure prediction system, called PROSPECT, which we have been developing for the past few years. The pipeline consists of seven logical phases, utilizing a dozen tools. The pipeline has been implemented to run in a heterogeneous computational environment as a client/server system with a web interface. A number of genome-scale applications have been carried out on microbial genomes. Here we present one genome-scale application on Caenorhabditis elegans.

[1]  E Uberbacher,et al.  Protein threading by PROSPECT: a prediction experiment in CASP3. , 1999, Protein engineering.

[2]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[3]  A. Sali,et al.  Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Y Xu,et al.  Protein threading using PROSPECT: Design and evaluation , 2000, Proteins.

[5]  Barry Honig,et al.  Extending the accuracy limits of prediction for side-chain conformations. , 2001 .

[6]  Dong Xu,et al.  PROSPECT II: protein structure prediction program for genome-scale applications. , 2003, Protein engineering.

[7]  R Nussinov,et al.  Fast protein fold recognition via sequence to structure alignment and contact capacity potentials. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[8]  Jérôme Gouzy,et al.  ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons , 2000, Nucleic Acids Res..

[9]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[10]  S. Bryant,et al.  Statistics of sequence-structure threading. , 1995, Current opinion in structural biology.

[11]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[12]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[13]  M. Sippl,et al.  Detection of native‐like models for amino acid sequences of unknown three‐dimensional structure in a data base of known protein conformations , 1992, Proteins.

[14]  M. Vidal,et al.  Structural genomics: A pipeline for providing structures for the biologist , 2002, Protein science : a publication of the Protein Society.

[15]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[16]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[17]  Amos Bairoch,et al.  The ENZYME data bank , 1993, Nucleic Acids Res..

[18]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[20]  D Xu,et al.  Application of PROSPECT in CASP4: Characterizing protein structures with new folds , 2001, Proteins.

[21]  T. N. Bhat,et al.  The Protein Data Bank: unifying the archive , 2002, Nucleic Acids Res..

[22]  G Vriend,et al.  WHAT IF: a molecular modeling and drug design program. , 1990, Journal of molecular graphics.

[23]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[24]  A. Lupas,et al.  Predicting coiled coils from protein sequences , 1991, Science.

[25]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[26]  D Xu,et al.  Model for the three‐dimensional structure of vitronectin: Predictions for the multi‐domain protein from threading and docking , 2001, Proteins.

[27]  Ying Xu,et al.  A Computational Method for NMR-Constrained Protein Threading , 2000, J. Comput. Biol..

[28]  G. Montelione,et al.  A banner year for membranes , 1999, Nature Structural Biology.

[29]  Shigeki Mitaku,et al.  SOSUI: classification and secondary structure prediction system for membrane proteins , 1998, Bioinform..

[30]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[31]  David J. Edwards,et al.  Functional annotation of proteomic sequences based on consensus of sequence and structural analysis , 2002, Briefings Bioinform..

[32]  T. Blundell,et al.  Comparative protein modelling by satisfaction of spatial restraints. , 1993, Journal of molecular biology.

[33]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.