Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm

This article describes the PROSPECTOR_3 threading algorithm, which combines various scoring functions designed to match structurally related target/template pairs. Each variant described was found to have a Z‐score above which most identified templates have good structural (threading) alignments, Zstruct (Zgood). ‘Easy’ targets with accurate threading alignments are identified as single templates with Z > Zgood or two templates, each with Z > Zstruct, having a good consensus structure in mutually aligned regions. ‘Medium’ targets have a pair of templates lacking a consensus structure, or a single template for which Zstruct < Z < Zgood. PROSPECTOR_3 was applied to a comprehensive Protein Data Bank (PDB) benchmark composed of 1491 single domain proteins, 41–200 residues long and no more than 30% identical to any threading template. Of the proteins, 878 were found to be easy targets, with 761 having a root mean square deviation (RMSD) from native of less than 6.5 Å. The average contact prediction accuracy was 46%, and on average 17.6 residue continuous fragments were predicted with RMSD values of 2.0 Å. There were 606 medium targets identified, 87% (31%) of which had good structural (threading) alignments. On average, 9.1 residue, continuous fragments with RMSD of 2.5 Å were predicted. Combining easy and medium sets, 63% (91%) of the targets had good threading (structural) alignments compared to native; the average target/template sequence identity was 22%. Only nine targets lacked matched templates. Moreover, PROSPECTOR_3 consistently outperforms PSIBLAST. Similar results were predicted for open reading frames (ORFS) ≤200 residues in the M. genitalium, E. coli and S. cerevisiae genomes. Thus, progress has been made in identification of weakly homologous/analogous proteins, with very high alignment coverage, both in a comprehensive PDB benchmark as well as in genomes. Proteins 2004;55:000–000. © 2004 Wiley‐Liss, Inc.

[1]  Timothy F. Havel,et al.  The combinatorial distance geometry method for the calculation of molecular conformation. II. Sample problems and computational statistics. , 1983, Journal of theoretical biology.

[2]  S Henikoff,et al.  Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[3]  Kun-Mao Chao,et al.  Recent Developments in Linear-Space Alignment Methods: A Survey , 1994, J. Comput. Biol..

[4]  W R Pearson,et al.  Using the FASTA program to search protein and DNA sequence databases. , 1994, Methods in molecular biology.

[5]  R. Fleischmann,et al.  The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[6]  J M Thornton,et al.  Successful protein fold recognition by optimal sequence threading validated by rigorous blind testing , 1995, Proteins.

[7]  D T Jones,et al.  Protein fold recognition by sequence threading: tools and assessment techniques , 1996, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[8]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[9]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[10]  J. M. Levin,et al.  Exploring the limits of nearest neighbour secondary structure prediction. , 1997, Protein engineering.

[11]  A. Godzik,et al.  Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? , 1997, Protein science : a publication of the Protein Society.

[12]  S H Bryant,et al.  Measures of threading specificity and accuracy , 1997, Proteins.

[13]  S. Kim,et al.  Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[14]  W. Pearson Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[15]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[16]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[17]  Klaus Hahn,et al.  Segment-Based Scores for Pairwise and Multiple Sequence Alignments , 1998, ISMB.

[18]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[19]  Richard H. Lathrop,et al.  An Anytime Local-to-Global Optimization Algorithm for Protein Threading in O(m2n2) Space , 1999, J. Comput. Biol..

[20]  S H Bryant,et al.  A measure of progress in fold recognition? , 1999, Proteins.

[21]  Alejandro A. Schäffer,et al.  IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[22]  Jacquelyn S. Fetrow,et al.  Structural genomics and its importance for gene function analysis , 2000, Nature Biotechnology.

[23]  A. Panchenko,et al.  Combination of threading potentials and sequence profiles improves fold recognition. , 2000, Journal of molecular biology.

[24]  J. Skolnick,et al.  From genes to protein structure and function: novel applications of computational approaches in the genomic era. , 2000, Trends in biotechnology.

[25]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[26]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[27]  M. Gerstein,et al.  Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome. , 2000, Nucleic acids research.

[28]  R Samudrala,et al.  Ab initio construction of protein tertiary structures using a hierarchical approach. , 2000, Journal of molecular biology.

[29]  A. Kolinski,et al.  Derivation of protein‐specific pair potentials based on weak sequence fragment similarity , 2000, Proteins.

[30]  Richard Bonneau,et al.  Rosetta in CASP4: Progress in ab initio protein structure prediction , 2001, Proteins.

[31]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[32]  H. G. Nagendra,et al.  Sequence analyses and comparative modeling of fly and worm fibroblast growth factor receptors indicate that the determinants for FGF and heparin binding are retained in evolution , 2001, FEBS letters.

[33]  Kevin Karplus,et al.  Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set , 2001, Bioinform..

[34]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[35]  Dmitrij Frishman,et al.  Functional and structural genomics using PEDANT , 2001, Bioinform..

[36]  J. Skolnick,et al.  Access the most recent version at doi: 10.1110/ps.49201 References , 2000 .

[37]  P Fariselli,et al.  Progress in predicting inter‐residue contacts of proteins with neural networks and correlated mutations , 2001, Proteins.

[38]  J Skolnick,et al.  Defrosting the frozen approximation: PROSPECTOR— A new approach to threading , 2001, Proteins.

[39]  J Skolnick,et al.  Universal similarity measure for comparing protein structures. , 2001, Biopolymers.

[40]  D. Baker,et al.  Prospects for ab initio protein structural genomics. , 2001, Journal of molecular biology.

[41]  D Fischer,et al.  LiveBench‐2: Large‐scale automated evaluation of protein structure prediction servers , 2001, Proteins.

[42]  J. Skolnick,et al.  Enhanced functional annotation of protein sequences via the use of structural descriptors. , 2001, Journal of structural biology.

[43]  T L Blundell,et al.  Sequence‐structure homology recognition by iterative alignment refinement and comparative modeling , 2001, Proteins.

[44]  J Lundström,et al.  Pcons: A neural‐network–based consensus predictor that improves fold recognition , 2001, Protein science : a publication of the Protein Society.

[45]  D Xu,et al.  Application of PROSPECT in CASP4: Characterizing protein structures with new folds , 2001, Proteins.

[46]  James E. Bray,et al.  A rapid classification protocol for the CATH Domain Database to support structural genomics , 2001, Nucleic Acids Res..

[47]  K Karplus,et al.  What is the value added by human intervention in protein structure prediction? , 2001, Proteins.

[48]  Volker A. Eyrich,et al.  EVA: Large‐scale analysis of secondary structure prediction , 2001, Proteins.

[49]  M. Vidal,et al.  Structural genomics: A pipeline for providing structures for the biologist , 2002, Protein science : a publication of the Protein Society.

[50]  Richard Bonneau,et al.  De novo prediction of three-dimensional structures for major protein families. , 2002, Journal of molecular biology.

[51]  [Protein structure information provided by the GTOP database and its applications]. , 2002, Tanpakushitsu kakusan koso. Protein, nucleic acid, enzyme.

[52]  Dmitrij Frishman,et al.  Knowledge-based selection of targets for structural genomics. , 2002, Protein engineering.

[53]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[54]  N. Saunders,et al.  Functional genomics of pathogenic bacteria. , 2002, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[55]  Cyrus Chothia,et al.  SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments , 2002, Nucleic Acids Res..

[56]  Liam J McGuffin,et al.  Targeting novel folds for structural genomics , 2002, Proteins.

[57]  Daisuke Kihara,et al.  Ab initio protein structure prediction on a genomic scale: Application to the Mycoplasma genitalium genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[58]  S. Buchanan Structural genomics: bridging functional genomics and structure-based drug design. , 2002, Current opinion in drug discovery & development.

[59]  Richard Bonneau,et al.  Contact order and ab initio protein structure prediction , 2002, Protein science : a publication of the Protein Society.

[60]  Susumu Goto,et al.  The KEGG databases at GenomeNet , 2002, Nucleic Acids Res..

[61]  Marek Wojciechowski,et al.  Docking of small ligands to low‐resolution and theoretically predicted receptor structures , 2002, J. Comput. Chem..

[62]  Shashi B. Pandit,et al.  SUPFAM - a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes , 2002, Nucleic Acids Res..

[63]  M. Madera,et al.  A comparison of profile hidden Markov model procedures for remote homology detection. , 2002, Nucleic acids research.

[64]  D. Fischer,et al.  LiveBench‐6: Large‐scale automated evaluation of protein structure prediction servers , 2003, Proteins.

[65]  Liam J. McGuffin,et al.  Improvement of the GenTHREADER Method for Genomic Fold Recognition , 2003, Bioinform..

[66]  A. Sali,et al.  Comparative protein structure modeling by iterative alignment, model building and model assessment. , 2003, Nucleic acids research.

[67]  J. Skolnick,et al.  The PDB is a covering set of small protein structures. , 2003, Journal of molecular biology.

[68]  J. Skolnick,et al.  TOUCHSTONE II: a new approach to ab initio protein structure prediction. , 2003, Biophysical journal.

[69]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[70]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[71]  Daniel Fischer,et al.  3D‐SHOTGUN: A novel, cooperative, fold‐recognition meta‐predictor , 2003, Proteins.

[72]  Daisuke Kihara,et al.  TOUCHSTONE: A unified approach to protein structure prediction , 2003, Proteins.

[73]  Daisuke Kihara,et al.  Microbial genomes have over 72% structure assignment by the threading algorithm PROSPECTOR_Q , 2004, Proteins.

[74]  Hongyi Zhou,et al.  Single‐body residue‐level knowledge‐based energy score combined with sequence‐profile and secondary structure information for fold recognition , 2004, Proteins.

[75]  J. Skolnick,et al.  Automated structure prediction of weakly homologous proteins on a genomic scale. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[76]  Yang Zhang,et al.  Large-scale assessment of the utility of low-resolution protein structures for biochemical function assignment , 2004, Bioinform..

[77]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .