Comparison of structure‐based and threading‐based approaches to protein functional annotation

To exploit the vast amount of sequence information provided by the Genomic revolution, the biological function of these sequences must be identified. As a practical matter, this is often accomplished by functional inference. Purely sequence‐based approaches, particularly in the “twilight zone” of low sequence similarity levels, are complicated by many factors. For proteins, structure‐based techniques aim to overcome these problems; however, most require high‐quality crystal structures and suffer from complex and equivocal relations between protein fold and function. In this study, in extensive benchmarking, we consider a number of aspects of structure‐based functional annotation: binding pocket detection, molecular function assignment and ligand‐based virtual screening. We demonstrate that protein threading driven by a strong sequence profile component greatly improves the quality of purely structure‐based functional annotation in the “twilight zone.” By detecting evolutionarily related proteins, it considerably reduces the high false positive rate of function inference derived on the basis of global structure similarity alone. Combined evolution/structure‐based function assignment emerges as a powerful technique that can make a significant contribution to comprehensive proteome annotation. Proteins 2010. © 2009 Wiley‐Liss, Inc.

[1]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[2]  J. Whisstock,et al.  Protein structural alignments and functional genomics , 2001, Proteins.

[3]  Seung Yup Lee,et al.  Analysis of TASSER‐based CASP7 protein structure prediction results , 2007, Proteins.

[4]  H. Edelsbrunner,et al.  Anatomy of protein pockets and cavities: Measurement of binding site geometry and implications for ligand design , 1998, Protein science : a publication of the Protein Society.

[5]  M. Levitt,et al.  A unified statistical framework for sequence comparison and structure comparison. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[6]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[7]  Lydia E. Kavraki,et al.  Prediction of enzyme function based on 3D templates of evolutionarily important amino acids , 2008, BMC Bioinformatics.

[8]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[9]  Jaroslav Koca,et al.  CAVER: a new tool to explore routes from protein clefts, pockets and cavities , 2006, BMC Bioinformatics.

[10]  Janet M. Thornton,et al.  ProFunc: a server for predicting protein function from 3D structure , 2005, Nucleic Acids Res..

[11]  M. Schroeder,et al.  LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation , 2006, BMC Structural Biology.

[12]  Christophe Combet,et al.  The SuMo server: 3D search for protein functional sites , 2005, Bioinform..

[13]  Jürgen Bajorath,et al.  Similarity Search Profiles as a Diagnostic Tool for the Analysis of Virtual Screening Calculations , 2004, J. Chem. Inf. Model..

[14]  G J Kleywegt,et al.  Recognition of spatial motifs in protein structures. , 1999, Journal of molecular biology.

[15]  David P. Dobkin,et al.  The quickhull algorithm for convex hulls , 1996, TOMS.

[16]  A. Godzik,et al.  Comparison of sequence profiles. Strategies for structural predictions using sequence information , 2008, Protein science : a publication of the Protein Society.

[17]  Janet M Thornton,et al.  Protein function prediction using local 3D templates. , 2005, Journal of molecular biology.

[18]  Jürgen Bajorath,et al.  New methodologies for ligand-based virtual screening. , 2005, Current pharmaceutical design.

[19]  Janet M Thornton,et al.  Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. , 2003, Nucleic acids research.

[20]  N. Ben-Tal,et al.  ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. , 2001, Journal of molecular biology.

[21]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[22]  Irena Roterman-Konieczna,et al.  Prediction of Functional Sites Based on the Fuzzy Oil Drop Model , 2007, PLoS Comput. Biol..

[23]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[24]  Barry Honig,et al.  GRASP2: visualization, surface properties, and electrostatics of macromolecular structures and sequences. , 2003, Methods in enzymology.

[25]  Torsten Schwede,et al.  Assessment of CASP7 predictions for template‐based modeling targets , 2007, Proteins.

[26]  John B. O. Mitchell The Relationship between the Sequence Identities of Alpha Helical Proteins in the PDB and the Molecular Similarities of Their Ligands , 2001, J. Chem. Inf. Comput. Sci..

[27]  A. Sali,et al.  Comparative protein structure modeling of genes and genomes. , 2000, Annual review of biophysics and biomolecular structure.

[28]  Allegra Via,et al.  FunClust: a web server for the identification of structural motifs in a set of non-homologous protein structures , 2008, BMC Bioinformatics.

[29]  J. Thornton,et al.  Tess: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites , 1997, Protein science : a publication of the Protein Society.

[30]  M. Helmer-Citterich,et al.  Structure-based function prediction: approaches and applications. , 2008, Briefings in functional genomics & proteomics.

[31]  D. van der Spoel,et al.  Blind docking of drug‐sized compounds to proteins with up to a thousand residues , 2006, FEBS letters.

[32]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt): an expanding universe of protein information , 2005, Nucleic Acids Res..

[33]  A. M. Lisewski,et al.  Rapid detection of similarity in protein structure and function through contact metric distances , 2006, Nucleic acids research.

[34]  Michael J. E. Sternberg,et al.  ConFunc - functional annotation in the twilight zone , 2008, Bioinform..

[35]  Yang Zhang,et al.  Large-scale assessment of the utility of low-resolution protein structures for biochemical function assignment , 2004, Bioinform..

[36]  S. Henikoff,et al.  Blocks database and its applications. , 1996, Methods in enzymology.

[37]  Jürgen Bajorath,et al.  Similarity Search Profiling Reveals Effects of Fingerprint Scaling in Virtual Screening. , 2005 .

[38]  SödingJohannes Protein homology detection by HMM--HMM comparison , 2005 .

[39]  Limsoon Wong,et al.  Using indirect protein interactions for the prediction of Gene Ontology functions , 2007, BMC Bioinformatics.

[40]  Mark A. Murcko,et al.  Virtual screening : an overview , 1998 .

[41]  R. Laskowski SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. , 1995, Journal of molecular graphics.

[42]  Marco Punta,et al.  The Rough Guide to In Silico Function Prediction, or How To Use Sequence and Structure Information To Predict Protein Function , 2008, PLoS Comput. Biol..

[43]  Richard A. Lewis,et al.  Lessons in molecular recognition: the effects of ligand and protein flexibility on molecular docking accuracy. , 2004, Journal of medicinal chemistry.

[44]  Jaime Prilusky,et al.  Automated analysis of interatomic contacts in proteins , 1999, Bioinform..

[45]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[46]  Marc A. Martí-Renom,et al.  The AnnoLite and AnnoLyze programs for comparative annotation of protein structures , 2007, BMC Bioinformatics.

[47]  Markus Porto,et al.  SABERTOOTH: protein structural alignment based on a vectorial structure representation , 2007, BMC Bioinformatics.

[48]  Hans-Peter Lenhof,et al.  BALL-rapid software prototyping in computational molecular biology , 2000, Bioinform..

[49]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[50]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[51]  Yang Zhang,et al.  TASSER: An automated method for the prediction of protein tertiary structures in CASP6 , 2005, Proteins.

[52]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[53]  Michal Brylinski,et al.  FINDSITE: a combined evolution/structure-based approach to protein function prediction , 2009, Briefings Bioinform..

[54]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[55]  M J Sternberg,et al.  Supersites within superfolds. Binding site similarity in the absence of homology. , 1998, Journal of molecular biology.

[56]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[57]  Pieter F. W. Stouten,et al.  Fast prediction and visualization of protein binding pockets with PASS , 2000, J. Comput. Aided Mol. Des..

[58]  Daisuke Kihara,et al.  Function Prediction of uncharacterized proteins , 2007, J. Bioinform. Comput. Biol..

[59]  Dariusz Plewczynski,et al.  3D-Fun: predicting enzyme function from structure , 2008, Nucleic Acids Res..

[60]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[61]  Randy J Read,et al.  Automated server predictions in CASP7 , 2007, Proteins.

[62]  Randy J Read,et al.  Assessment of CASP7 predictions in the high accuracy template‐based modeling category , 2007, Proteins.

[63]  M Hendlich,et al.  LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. , 1997, Journal of molecular graphics & modelling.

[64]  Ajay N. Jain,et al.  Automatic identification and representation of protein binding sites for molecular docking , 1997, Protein science : a publication of the Protein Society.

[65]  Shoshana J. Wodak,et al.  Relating destabilizing regions to known functional sites in proteins , 2007, BMC Bioinformatics.

[66]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[67]  W. Kabsch A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[68]  J. Thornton,et al.  Predicting protein function from sequence and structural data. , 2005, Current opinion in structural biology.

[69]  Ying Huang,et al.  EFICAz2: enzyme function inference by a combined approach enhanced by machine learning , 2009, BMC Bioinformatics.

[70]  Vincent Le Guilloux,et al.  Fpocket: An open source platform for ligand pocket detection , 2009, BMC Bioinformatics.

[71]  Karl H. Clodfelter,et al.  Identification of substrate binding sites in enzymes by computational solvent mapping. , 2003, Journal of molecular biology.

[72]  Alfonso Valencia,et al.  firestar—prediction of functionally important residues using structural templates and alignment reliability , 2007, Nucleic Acids Res..

[73]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[74]  Yang Zhang,et al.  Scoring function for automated assessment of protein structure template quality , 2004, Proteins.

[75]  Philip E. Bourne,et al.  A robust and efficient algorithm for the shape description of protein structures and its application in predicting ligand binding sites , 2007, BMC Bioinformatics.

[76]  Robert B Russell,et al.  A model for statistical significance of local similarities in structure. , 2003, Journal of molecular biology.

[77]  D. Lipman,et al.  Rapid and sensitive protein similarity searches. , 1985, Science.

[78]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[79]  J. Skolnick,et al.  EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. , 2004, Nucleic acids research.

[80]  J. Skolnick,et al.  On the origin and highly likely completeness of single-domain protein structures. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[81]  M. Ondrechen,et al.  THEMATICS: A simple computational predictor of enzyme function from structure , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[82]  Jeffrey Skolnick,et al.  Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score , 2008, BMC Bioinformatics.

[83]  Jürgen Bajorath,et al.  Design and Evaluation of a Molecular Fingerprint Involving the Transformation of Property Descriptor Values into a Binary Classification Scheme , 2003, J. Chem. Inf. Comput. Sci..

[84]  H. Wolfson,et al.  From structure to function: methods and applications. , 2005, Current protein & peptide science.

[85]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[86]  A. Elcock Prediction of functionally important residues based solely on the computed energetics of protein structure. , 2001, Journal of molecular biology.

[87]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[88]  J. Skolnick,et al.  A threading-based method (FINDSITE) for ligand-binding site prediction and functional annotation , 2008, Proceedings of the National Academy of Sciences.

[89]  B. Shoichet,et al.  Information decay in molecular docking screens against holo, apo, and modeled conformations of enzymes. , 2003, Journal of medicinal chemistry.

[90]  J. Skolnick,et al.  Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm , 2004, Proteins.

[91]  Andreas Martin Lisewski,et al.  De-Orphaning the Structural Proteome through Reciprocal Comparison of Evolutionarily Important Structural Features , 2008, PloS one.

[92]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[93]  Ingo Muegge,et al.  Advances in virtual screening , 2006, Drug Discovery Today: Technologies.

[94]  Robert B. Russell,et al.  DILIMOT: discovery of linear motifs in proteins , 2006, Nucleic Acids Res..

[95]  David T. Jones,et al.  Threading methods for protein structure prediction , 2000 .

[96]  Sean R. Eddy,et al.  Pfam: multiple sequence alignments and HMM-profiles of protein domains , 1998, Nucleic Acids Res..

[97]  Julia V Ponomarenko,et al.  Assigning new GO annotations to protein data bank sequences by combining structure and sequence homology , 2005, Proteins.

[98]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[99]  P. Babbitt,et al.  Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. , 2001, Annual review of biochemistry.

[100]  Leszek Rychlewski,et al.  ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins , 2003, Nucleic Acids Res..

[101]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[102]  J. Skolnick,et al.  From genes to protein structure and function: novel applications of computational approaches in the genomic era. , 2000, Trends in biotechnology.