FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB quickly and accurately

Fast identification of protein structures that are similar to a specified query structure in the entire Protein Data Bank (PDB) is fundamental in structure and function prediction. We present FragBag: An ultrafast and accurate method for comparing protein structures. We describe a protein structure by the collection of its overlapping short contiguous backbone segments, and discretize this set using a library of fragments. Then, we succinctly represent the protein as a “bags-of-fragments”—a vector that counts the number of occurrences of each fragment—and measure the similarity between two structures by the similarity between their vectors. Our representation has two additional benefits: (i) it can be used to construct an inverted index, for implementing a fast structural search engine of the entire PDB, and (ii) one can specify a structure as a collection of substructures, without combining them into a single structure; this is valuable for structure prediction, when there are reliable predictions only of parts of the protein. We use receiver operating characteristic curve analysis to quantify the success of FragBag in identifying neighbor candidate sets in a dataset of over 2,900 structures. The gold standard is the set of neighbors found by six state of the art structural aligners. Our best FragBag library finds more accurate candidate sets than the three other filter methods: The SGM, PRIDE, and a method by Zotenko et al. More interestingly, FragBag performs on a par with the computationally expensive, yet highly trusted structural aligners STRUCTAL and CE.

[1]  Jason Weston,et al.  SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition , 2007, BMC Bioinformatics.

[2]  Kian-Lee Tan,et al.  Rapid 3D protein structure database searching using information retrieval techniques , 2004, Bioinform..

[3]  P. Røgen,et al.  Automatic classification of protein structure by using Gauss integrals , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[4]  G. Kleywegt Use of non-crystallographic symmetry in protein structure refinement. , 1996, Acta crystallographica. Section D, Biological crystallography.

[5]  R. Kolodny,et al.  Protein structure comparison: implications for the nature of 'fold space', and structure and function prediction. , 2006, Current opinion in structural biology.

[6]  Yang Zhang Progress and challenges in protein structure prediction. , 2008, Current opinion in structural biology.

[7]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[8]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[9]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[10]  Yang Zhang,et al.  TASSER: An automated method for the prediction of protein tertiary structures in CASP6 , 2005, Proteins.

[11]  Joachim M. Buhmann,et al.  Empirical evaluation of dissimilarity measures for color and texture , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[12]  A. Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP)—round IX , 2011, Proteins.

[13]  Thomas L. Madden,et al.  BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. , 1999, FEMS microbiology letters.

[14]  K Henrick,et al.  Electronic Reprint Biological Crystallography Secondary-structure Matching (ssm), a New Tool for Fast Protein Structure Alignment in Three Dimensions Biological Crystallography Secondary-structure Matching (ssm), a New Tool for Fast Protein Structure Alignment in Three Dimensions , 2022 .

[15]  Adam Godzik,et al.  Fragnostic: walking through protein structure space , 2005, Nucleic Acids Res..

[16]  J. Skolnick,et al.  The PDB is a covering set of small protein structures. , 2003, Journal of molecular biology.

[17]  Adam Godzik,et al.  Using an alignment of fragment strings for comparing protein structures , 2007, Bioinform..

[18]  J. Thornton,et al.  Predicting protein function from sequence and structural data. , 2005, Current opinion in structural biology.

[19]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[20]  M. Levitt,et al.  Small libraries of protein fragments model native protein structures accurately. , 2002, Journal of molecular biology.

[21]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction—Round VII , 2007, Proteins.

[22]  Rupert G. Miller Simultaneous Statistical Inference , 1966 .

[23]  S. Pongor,et al.  Protein fold similarity estimated by a probabilistic approach based on C(alpha)-C(alpha) distance comparison. , 2002, Journal of molecular biology.

[24]  Rachel Kolodny,et al.  Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. , 2005, Journal of molecular biology.

[25]  Adam Godzik,et al.  Flexible Structural Neighborhood—a database of protein structural similarities and alignments , 2005, Nucleic Acids Res..

[26]  R. Kolodny,et al.  Sequence-similar, structure-dissimilar protein pairs in the PDB , 2007, Proteins.

[27]  W R Taylor,et al.  Protein structure alignment. , 1989, Journal of molecular biology.

[28]  M. Levitt,et al.  Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core , 1993, Current Biology.

[29]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[30]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[31]  S. Pongor,et al.  Protein fold similarity estimated by a probabilistic approach based on Cα-Cα distance comparison , 2002 .

[32]  Barry Honig,et al.  Is protein classification necessary? Toward alternative approaches to function annotation. , 2009, Current opinion in structural biology.

[33]  Kian-Lee Tan,et al.  Rapid retrieval of protein structures from databases. , 2007, Drug discovery today.

[34]  D. O’Leary,et al.  Secondary structure spatial conformation footprint: a novel method for fast protein structure comparison and classification , 2006, BMC Structural Biology.

[35]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[36]  Jinn-Moon Yang,et al.  Kappa-alpha plot derived structural alphabet and BLOSUM-like substitution matrix for rapid search of protein structure database , 2007, Genome Biology.

[37]  Sung-Hou Kim,et al.  Local feature frequency profile: a method to measure structural similarity in proteins. , 2004, Proceedings of the National Academy of Sciences of the United States of America.