论文信息 - Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm

Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm

This article describes the PROSPECTOR_3 threading algorithm, which combines various scoring functions designed to match structurally related target/template pairs. Each variant described was found to have a Z‐score above which most identified templates have good structural (threading) alignments, Zstruct (Zgood). ‘Easy’ targets with accurate threading alignments are identified as single templates with Z > Zgood or two templates, each with Z > Zstruct, having a good consensus structure in mutually aligned regions. ‘Medium’ targets have a pair of templates lacking a consensus structure, or a single template for which Zstruct < Z < Zgood. PROSPECTOR_3 was applied to a comprehensive Protein Data Bank (PDB) benchmark composed of 1491 single domain proteins, 41–200 residues long and no more than 30% identical to any threading template. Of the proteins, 878 were found to be easy targets, with 761 having a root mean square deviation (RMSD) from native of less than 6.5 Å. The average contact prediction accuracy was 46%, and on average 17.6 residue continuous fragments were predicted with RMSD values of 2.0 Å. There were 606 medium targets identified, 87% (31%) of which had good structural (threading) alignments. On average, 9.1 residue, continuous fragments with RMSD of 2.5 Å were predicted. Combining easy and medium sets, 63% (91%) of the targets had good threading (structural) alignments compared to native; the average target/template sequence identity was 22%. Only nine targets lacked matched templates. Moreover, PROSPECTOR_3 consistently outperforms PSIBLAST. Similar results were predicted for open reading frames (ORFS) ≤200 residues in the M. genitalium, E. coli and S. cerevisiae genomes. Thus, progress has been made in identification of weakly homologous/analogous proteins, with very high alignment coverage, both in a comprehensive PDB benchmark as well as in genomes. Proteins 2004;55:000–000. © 2004 Wiley‐Liss, Inc.

J. Skolnick | D. Kihara | Yang Zhang

[1] Timothy F. Havel,et al. The combinatorial distance geometry method for the calculation of molecular conformation. II. Sample problems and computational statistics. , 1983, Journal of theoretical biology.

[2] S Henikoff,et al. Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[3] Kun-Mao Chao,et al. Recent Developments in Linear-Space Alignment Methods: A Survey , 1994, J. Comput. Biol..

[4] W R Pearson,et al. Using the FASTA program to search protein and DNA sequence databases. , 1994, Methods in molecular biology.

[5] R. Fleischmann,et al. The Minimal Gene Complement of Mycoplasma genitalium , 1995, Science.

[6] J M Thornton,et al. Successful protein fold recognition by optimal sequence threading validated by rigorous blind testing , 1995, Proteins.

[7] D T Jones,et al. Protein fold recognition by sequence threading: tools and assessment techniques , 1996, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[8] Tim J. P. Hubbard,et al. SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[9] N. W. Davis,et al. The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[10] J. M. Levin,et al. Exploring the limits of nearest neighbour secondary structure prediction. , 1997, Protein engineering.

[11] A. Godzik,et al. Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? , 1997, Protein science : a publication of the Protein Society.

[12] S H Bryant,et al. Measures of threading specificity and accuracy , 1997, Proteins.

[13] S. Kim,et al. Structure-based assignment of the biochemical function of a hypothetical protein: a test case of structural genomics. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[14] W. Pearson. Empirical statistical estimates for sequence similarity searches. , 1998, Journal of molecular biology.

[15] Richard Hughey,et al. Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[16] S F Altschul,et al. Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[17] Klaus Hahn,et al. Segment-Based Scores for Pairwise and Multiple Sequence Alignments , 1998, ISMB.

[18] Dmitrij Frishman,et al. MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[19] Richard H. Lathrop,et al. An Anytime Local-to-Global Optimization Algorithm for Protein Threading in O(m2n2) Space , 1999, J. Comput. Biol..

[20] S H Bryant,et al. A measure of progress in fold recognition? , 1999, Proteins.

[21] Alejandro A. Schäffer,et al. IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices , 1999, Bioinform..

[22] Jacquelyn S. Fetrow,et al. Structural genomics and its importance for gene function analysis , 2000, Nature Biotechnology.

[23] A. Panchenko,et al. Combination of threading potentials and sequence profiles improves fold recognition. , 2000, Journal of molecular biology.

[24] J. Skolnick,et al. From genes to protein structure and function: novel applications of computational approaches in the genomic era. , 2000, Trends in biotechnology.

[25] T. N. Bhat,et al. The Protein Data Bank , 2000, Nucleic Acids Res..

[26] Liam J. McGuffin,et al. The PSIPRED protein structure prediction server , 2000, Bioinform..

[27] M. Gerstein,et al. Proteomics of Mycoplasma genitalium: identification and characterization of unannotated and atypical proteins in a small model genome. , 2000, Nucleic acids research.

[28] R Samudrala,et al. Ab initio construction of protein tertiary structures using a hierarchical approach. , 2000, Journal of molecular biology.

[29] A. Kolinski,et al. Derivation of protein‐specific pair potentials based on weak sequence fragment similarity , 2000, Proteins.

[30] Richard Bonneau,et al. Rosetta in CASP4: Progress in ab initio protein structure prediction , 2001, Proteins.

[31] A. Sali,et al. Protein Structure Prediction and Structural Genomics , 2001, Science.

[32] H. G. Nagendra,et al. Sequence analyses and comparative modeling of fly and worm fibroblast growth factor receptors indicate that the determinants for FGF and heparin binding are retained in evolution , 2001, FEBS letters.

[33] Kevin Karplus,et al. Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set , 2001, Bioinform..

[34] Thomas L. Madden,et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[35] Dmitrij Frishman,et al. Functional and structural genomics using PEDANT , 2001, Bioinform..

[36] J. Skolnick,et al. Access the most recent version at doi: 10.1110/ps.49201 References , 2000 .

[37] P Fariselli,et al. Progress in predicting inter‐residue contacts of proteins with neural networks and correlated mutations , 2001, Proteins.

[38] J Skolnick,et al. Defrosting the frozen approximation: PROSPECTOR— A new approach to threading , 2001, Proteins.

[39] J Skolnick,et al. Universal similarity measure for comparing protein structures. , 2001, Biopolymers.

[40] D. Baker,et al. Prospects for ab initio protein structural genomics. , 2001, Journal of molecular biology.

[41] D Fischer,et al. LiveBench‐2: Large‐scale automated evaluation of protein structure prediction servers , 2001, Proteins.

[42] J. Skolnick,et al. Enhanced functional annotation of protein sequences via the use of structural descriptors. , 2001, Journal of structural biology.

[43] T L Blundell,et al. Sequence‐structure homology recognition by iterative alignment refinement and comparative modeling , 2001, Proteins.

[44] J Lundström,et al. Pcons: A neural‐network–based consensus predictor that improves fold recognition , 2001, Protein science : a publication of the Protein Society.

[45] D Xu,et al. Application of PROSPECT in CASP4: Characterizing protein structures with new folds , 2001, Proteins.

[46] James E. Bray,et al. A rapid classification protocol for the CATH Domain Database to support structural genomics , 2001, Nucleic Acids Res..

[47] K Karplus,et al. What is the value added by human intervention in protein structure prediction? , 2001, Proteins.

[48] Volker A. Eyrich,et al. EVA: Large‐scale analysis of secondary structure prediction , 2001, Proteins.

[49] M. Vidal,et al. Structural genomics: A pipeline for providing structures for the biologist , 2002, Protein science : a publication of the Protein Society.

[50] Richard Bonneau,et al. De novo prediction of three-dimensional structures for major protein families. , 2002, Journal of molecular biology.

[51] [Protein structure information provided by the GTOP database and its applications]. , 2002, Tanpakushitsu kakusan koso. Protein, nucleic acid, enzyme.

[52] Dmitrij Frishman,et al. Knowledge-based selection of targets for structural genomics. , 2002, Protein engineering.

[53] Golan Yona,et al. Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[54] N. Saunders,et al. Functional genomics of pathogenic bacteria. , 2002, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[55] Cyrus Chothia,et al. SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments , 2002, Nucleic Acids Res..

[56] Liam J McGuffin,et al. Targeting novel folds for structural genomics , 2002, Proteins.

[57] Daisuke Kihara,et al. Ab initio protein structure prediction on a genomic scale: Application to the Mycoplasma genitalium genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[58] S. Buchanan. Structural genomics: bridging functional genomics and structure-based drug design. , 2002, Current opinion in drug discovery & development.

[59] Richard Bonneau,et al. Contact order and ab initio protein structure prediction , 2002, Protein science : a publication of the Protein Society.

[60] Susumu Goto,et al. The KEGG databases at GenomeNet , 2002, Nucleic Acids Res..

[61] Marek Wojciechowski,et al. Docking of small ligands to low‐resolution and theoretically predicted receptor structures , 2002, J. Comput. Chem..

[62] Shashi B. Pandit,et al. SUPFAM - a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes , 2002, Nucleic Acids Res..

[63] M. Madera,et al. A comparison of profile hidden Markov model procedures for remote homology detection. , 2002, Nucleic acids research.

[64] D. Fischer,et al. LiveBench‐6: Large‐scale automated evaluation of protein structure prediction servers , 2003, Proteins.

[65] Liam J. McGuffin,et al. Improvement of the GenTHREADER Method for Genomic Fold Recognition , 2003, Bioinform..

[66] A. Sali,et al. Comparative protein structure modeling by iterative alignment, model building and model assessment. , 2003, Nucleic acids research.

[67] J. Skolnick,et al. The PDB is a covering set of small protein structures. , 2003, Journal of molecular biology.

[68] J. Skolnick,et al. TOUCHSTONE II: a new approach to ab initio protein structure prediction. , 2003, Biophysical journal.

[69] Rodrigo Lopez,et al. Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[70] Maria Jesus Martin,et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[71] Daniel Fischer,et al. 3D‐SHOTGUN: A novel, cooperative, fold‐recognition meta‐predictor , 2003, Proteins.

[72] Daisuke Kihara,et al. TOUCHSTONE: A unified approach to protein structure prediction , 2003, Proteins.

[73] Daisuke Kihara,et al. Microbial genomes have over 72% structure assignment by the threading algorithm PROSPECTOR_Q , 2004, Proteins.

[74] Hongyi Zhou,et al. Single‐body residue‐level knowledge‐based energy score combined with sequence‐profile and secondary structure information for fold recognition , 2004, Proteins.

[75] J. Skolnick,et al. Automated structure prediction of weakly homologous proteins on a genomic scale. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[76] Yang Zhang,et al. Large-scale assessment of the utility of low-resolution protein structures for biochemical function assignment , 2004, Bioinform..

[77] Christus,et al. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .