Fast and accurate protein substructure searching with simulated annealing and GPUs

BackgroundSearching a database of protein structures for matches to a query structure, or occurrences of a structural motif, is an important task in structural biology and bioinformatics. While there are many existing methods for structural similarity searching, faster and more accurate approaches are still required, and few current methods are capable of substructure (motif) searching.ResultsWe developed an improved heuristic for tableau-based protein structure and substructure searching using simulated annealing, that is as fast or faster and comparable in accuracy, with some widely used existing methods. Furthermore, we created a parallel implementation on a modern graphics processing unit (GPU).ConclusionsThe GPU implementation achieves up to 34 times speedup over the CPU implementation of tableau-based structure search with simulated annealing, making it one of the fastest available methods. To the best of our knowledge, this is the first application of a GPU to the protein structural search problem.

[1]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[2]  A Elofsson,et al.  Assessing the performance of fold recognition methods by means of a comprehensive benchmark. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[3]  Bernard Manderick,et al.  PDB file parser and structure class implemented in Python , 2003, Bioinform..

[4]  David R. Gilbert,et al.  Motif-based searching in TOPS protein topology databases , 1999, Bioinform..

[5]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[6]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[7]  Arthur M Lesk,et al.  Contact patterns between helices and strands of sheet define protein folding patterns , 2007, Proteins.

[8]  Liisa Holm,et al.  DaliLite workbench for protein structure comparison , 2000, Bioinform..

[9]  Karl Frank,et al.  COPS Benchmark: interactive analysis of database search methods , 2010, Bioinform..

[10]  Markus Gruber,et al.  COPS—a novel workbench for explorations in fold space , 2009, Nucleic Acids Res..

[11]  S. Pongor,et al.  Protein fold similarity estimated by a probabilistic approach based on Cα-Cα distance comparison , 2002 .

[12]  Andreas Hildebrandt,et al.  Highly accelerated feature detection in proteomics data sets using modern graphics processing units , 2009, Bioinform..

[13]  D. Brutlag,et al.  FoldMiner: Structural motif discovery using an improved superposition algorithm , 2004, Protein science : a publication of the Protein Society.

[14]  Jerome Spanier,et al.  Dynamic creation of pseudorandom number generators , 2000 .

[15]  Francisco Melo,et al.  StAR: a simple tool for the statistical comparison of ROC curves , 2008, BMC Bioinformatics.

[16]  James Bailey,et al.  A fast indexing approach for protein structure comparison , 2010, BMC Bioinformatics.

[17]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[18]  BMC Bioinformatics , 2005 .

[19]  Alberto Caprara,et al.  Structural alignment of large—size proteins via lagrangian relaxation , 2002, RECOMB '02.

[20]  Joël Pothier,et al.  YAKUSA: A fast structural database scanning method , 2005, Proteins.

[21]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[22]  P. Argos,et al.  Knowledge‐based protein secondary structure assignment , 1995, Proteins.

[23]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[24]  L. Pauling,et al.  Configurations of Polypeptide Chains With Favored Orientations Around Single Bonds: Two New Pleated Sheets. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[25]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[26]  Gerhard Wagner,et al.  Ubiquitin family proteins and their relationship to the proteasome: a structural perspective. , 2004, Biochimica et biophysica acta.

[27]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[28]  Vijay S. Pande,et al.  Accelerating molecular dynamic simulation on graphics processing units , 2009, J. Comput. Chem..

[29]  J. Marcos Moreno-Vega,et al.  A simple and fast heuristic for protein structure comparison , 2008, BMC Bioinformatics.

[30]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[31]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[32]  Fan Meng,et al.  The gputools package enables GPU computing in R , 2010, Bioinform..

[33]  Frances M. G. Pearl,et al.  Quantifying the similarities within fold space. , 2002, Journal of molecular biology.

[34]  J. Jung,et al.  Protein structure alignment using environmental profiles. , 2000, Protein engineering.

[35]  Svetlana Kirillova,et al.  Progress in the PRIDE technique for rapidly comparing protein three-dimensional structures , 2008, BMC Research Notes.

[36]  Tim J. P. Hubbard,et al.  Data growth and its impact on the SCOP database: new developments , 2007, Nucleic Acids Res..

[37]  Nick V. Grishin,et al.  ProSMoS server: a pattern-based search using interaction matrix representation of protein structures , 2009, Nucleic Acids Res..

[38]  Marc A. Suchard,et al.  Many-core algorithms for statistical phylogenetics , 2009, Bioinform..

[39]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[40]  Wei Xie,et al.  A Branch-and-Reduce Algorithm for the Contact Map Overlap Problem , 2006, RECOMB.

[41]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[42]  Cole Trapnell,et al.  Optimizing data intensive GPGPU computations for DNA sequence alignment , 2009, Parallel Comput..

[43]  J. Whisstock,et al.  An overview of the serpin superfamily , 2006, Genome Biology.

[44]  Peter J. Stuckey,et al.  Tableau-based protein substructure search using quadratic programming , 2009, BMC Bioinformatics.

[45]  Shi-Hua Zhang,et al.  Alignment of molecular networks by integer quadratic programming , 2007, Bioinform..

[46]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[47]  Bonnie Berger,et al.  Local Optimization for Global Alignment of Protein Interaction Networks , 2010, Pacific Symposium on Biocomputing.

[48]  Giorgio Valle,et al.  CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment , 2008, BMC Bioinformatics.

[49]  Sándor Pongor,et al.  Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm , 2005, Bioinform..

[50]  W. Delano The PyMOL Molecular Graphics System (2002) , 2002 .

[51]  D. Lomas,et al.  Topography of a 2.0 Å structure of α1‐antitrypsin reveals targets for rational drug design to prevent conformational disease , 2000, Protein science : a publication of the Protein Society.

[52]  Robert D. Carr,et al.  1001 Optimal PDB Structure Alignments: Integer Programming Methods for Finding the Maximum Contact Map Overlap , 2004, J. Comput. Biol..

[53]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[54]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[55]  A M Lesk,et al.  Systematic representation of protein folding patterns. , 1995, Journal of molecular graphics.

[56]  K Henrick,et al.  Electronic Reprint Biological Crystallography Secondary-structure Matching (ssm), a New Tool for Fast Protein Structure Alignment in Three Dimensions Biological Crystallography Secondary-structure Matching (ssm), a New Tool for Fast Protein Structure Alignment in Three Dimensions , 2022 .

[57]  W. Delano The PyMOL Molecular Graphics System , 2002 .

[58]  N. Alexandrov,et al.  SARFing the PDB. , 1996, Protein engineering.

[59]  Nick V. Grishin,et al.  PALSSE: A program to delineate linear secondary structural elements from protein structures , 2005, BMC Bioinformatics.

[60]  Nick V. Grishin,et al.  Structural drift: a possible path to protein fold change , 2005, Bioinform..

[61]  J. Gibrat,et al.  Protein secondary structure assignment revisited: a detailed analysis of different assignment methods , 2005, BMC Structural Biology.

[62]  Manfred J. Sippl,et al.  On distance and similarity in fold space , 2008, Bioinform..

[63]  David R. Gilbert,et al.  Protein structure topological comparison, discovery and matching service , 2005, Bioinform..

[64]  Yongchao Liu,et al.  CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units , 2009, BMC Research Notes.

[65]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[66]  James A. Casbon,et al.  A high level interface to SCOP and ASTRAL implemented in Python , 2006, BMC Bioinformatics.

[67]  Yi Zhong,et al.  Searching for three-dimensional secondary structural patterns in proteins with ProSMoS , 2007, Bioinform..

[68]  Manfred J. Sippl,et al.  A note on difficult structure alignment problems , 2008, Bioinform..

[69]  John E. Stone,et al.  Long time-scale simulations of in vivo diffusion using GPU hardware , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[70]  Robert D. Carr,et al.  101 optimal PDB structure alignments: a branch-and-cut algorithm for the maximum contact map overlap problem , 2001, RECOMB.

[71]  Iain S. Duff,et al.  MA57---a code for the solution of sparse symmetric definite and indefinite systems , 2004, TOMS.

[72]  Peter J. Stuckey,et al.  Structural search and retrieval using a tableau representation of protein folding patterns , 2008, Bioinform..