Efficient and Automated Analysis of Protein Structures

In recent years, computational complexity in structural bioinformatics attained a new level with the vast increase in the amount of structural data available. The Protein Data Bank (PDB), which is the single worldwide repository for 3-D macromolecular structure data, contains more than 25k structures as of July 2004. However, existing methods for protein structure analysis are unable to cope with this increase in the amount of available data. Therefore, this wealth of data requires computationally efficient methods to be developed for the analysis of large numbers of protein structures and their associated functions. In this dissertation, we present methods for protein structure analysis that can scale well with the amount of protein structure data available. Our work can be described under three main categories: (1) visualization and surface modelling, (2) structure comparison and similarity search, and (3) automated classification. For efficiently visualizing protein structures using a scene-graph based graphics API, we have developed methods to optimize the constructed scene-graph to enable real-time visualization of very large protein complexes. Our method (FPV) achieves up to 8 times interactive speed compared to existing methods. For generation of molecular surfaces we recently developed a method based on a level set formulation that can compute the surface and interior inaccessible cavities very efficiently (1.5 to 3.14 times faster on the average than compared methods). For comparison and similarity search of protein structures we have developed a method that utilizes local shape signatures based on the theory of differential geometry. Our method (CTSS) is up to 30 times faster than CE, a widely used method for structure comparison, while achieving the similar level of accuracy. We have also developed an integrated sequence and structure analysis method (ProGreSS), which enables biologists to perform joint sequence and structure similarity queries while improving on the accuracy and efficiency of existing methods. For an up-to-date view of the protein structure universe with the help of automated classification, we have developed an ensemble classifier based on decision trees rooted in machine learning. We show that higher classification accuracy can be achieved using the joint hypothesis of the ensemble classifier.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  David J. States,et al.  Identification of protein coding regions by database similarity search , 1993, Nature Genetics.

[3]  N. Guex,et al.  SWISS‐MODEL and the Swiss‐Pdb Viewer: An environment for comparative protein modeling , 1997, Electrophoresis.

[4]  A. Sali,et al.  Structural genomics: beyond the Human Genome Project , 1999, Nature Genetics.

[5]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[6]  Guoguang Lu,et al.  TOP: a new method for protein structure comparisons and similarity searches , 2000 .

[7]  Herbert Edelsbrunner,et al.  Three-dimensional alpha shapes , 1992, VVS.

[8]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[9]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[10]  C C Huang,et al.  Integrated tools for structural and sequence alignment and analysis. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[11]  A Elofsson,et al.  Assessing the performance of fold recognition methods by means of a comprehensive benchmark. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[12]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[13]  J F Gibrat,et al.  Surprising similarities in structure comparison. , 1996, Current opinion in structural biology.

[14]  M F Sanner,et al.  Python: a programming language for software integration and development. , 1999, Journal of molecular graphics & modelling.

[15]  Ambuj K. Singh,et al.  ProGreSS: Simultaneous Searching of Protein Databases by Sequence and Structure , 2004, Pacific Symposium on Biocomputing.

[16]  W. Pearson,et al.  Evolution of protein sequences and structures. , 1999, Journal of molecular biology.

[17]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[19]  H. Wolfson,et al.  Efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[20]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[21]  William R. Taylor,et al.  Structure Comparison and Structure Patterns , 2000, J. Comput. Biol..

[22]  Michael S. Chapman,et al.  Protein surfaces and volumes: measurement and use , 2006 .

[23]  Yuriko Yamagata,et al.  Buried water molecules contribute to the conformational stability of a protein. , 2003, Protein engineering.

[24]  M. Billeter,et al.  MOLMOL: a program for display and analysis of macromolecular structures. , 1996, Journal of molecular graphics.

[25]  R A Sayle,et al.  RASMOL: biomolecular graphics for all. , 1995, Trends in biochemical sciences.

[26]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[27]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[28]  D Walther,et al.  WebMol--a Java-based PDB viewer. , 1997, Trends in biochemical sciences.

[29]  Ambuj K. Singh,et al.  Towards index-based similarity search for protein structure databases , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[30]  Charlotte M. Deane,et al.  JOY: protein sequence-structure representation and analysis , 1998, Bioinform..

[31]  M. Gerstein Integrative database analysis in structural genomics , 2000, Nature Structural Biology.

[32]  Gunnar Rätsch,et al.  Advanced Lectures on Machine Learning , 2004, Lecture Notes in Computer Science.

[33]  Douglas L. Brutlag,et al.  Hierarchical Protein Structure Superposition Using Both Secondary Structure and Atomic Representations , 1997, ISMB.

[34]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[35]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[36]  M Levitt,et al.  Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins , 1998, Protein science : a publication of the Protein Society.

[37]  Yuan-Fang Wang,et al.  FPV: Fast Protein Visualization Using Java 3DTM , 2003, Bioinform..

[38]  P. Argos,et al.  Cavities and packing at protein interfaces , 1994, Protein science : a publication of the Protein Society.

[39]  Kian-Lee Tan,et al.  An efficient index-based protein structure database searching method , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[40]  Eytan Domany,et al.  Automated assignment of SCOP and CATH protein structure classifications from FSSP scores , 2002, Proteins.

[41]  T. Blundell,et al.  Catching a common fold , 1993, Protein science : a publication of the Protein Society.

[42]  Ruth Nussinov,et al.  Multiple Structural Alignment and Core Detection by Geometric Hashing , 1999, ISMB.

[43]  M. Sanner,et al.  Reduced surface: an efficient way to compute molecular surfaces. , 1996, Biopolymers.

[44]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[45]  R. Lathrop The protein threading problem with sequence amino acid interaction preferences is NP-complete. , 1994, Protein engineering.

[46]  Yuan-Fang Wang,et al.  CTSS: a robust and efficient method for protein structure alignment based on local geometrical and biological features , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[47]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[48]  K. Sharp,et al.  Protein folding and association: Insights from the interfacial and thermodynamic properties of hydrocarbons , 1991, Proteins.

[49]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[50]  M. Linial,et al.  Estimating the probability for a protein to have a new fold: A statistical computational model. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Nicholas Ayache,et al.  A geometric algorithm to find small but highly similar 3D substructures in proteins , 1998, Bioinform..

[52]  P E Bourne,et al.  Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. , 1998, Protein engineering.

[53]  William E. Lorensen,et al.  Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[54]  Nicholas Ayache,et al.  Smoothing and matching of 3-d space curves , 1992, Other Conferences.

[55]  Yehezkel Lamdan,et al.  Geometric Hashing: A General And Efficient Model-based Recognition Scheme , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[56]  James A. Sethian,et al.  Level Set Methods and Fast Marching Methods: Evolving Interfaces in Computational Geometry, Fluid , 2012 .

[57]  S. Rackovsky,et al.  Differential Geometry and Polymer Conformation. 1. Comparison of Protein Conformations1a,b , 1978 .

[58]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[59]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[60]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[61]  Chris Sander,et al.  3-D Lookup: Fast Protein Structure Database Searches at 90% Reliability , 1995, ISMB.

[62]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[63]  M. L. Connolly Analytical molecular surface calculation , 1983 .

[64]  J Lundström,et al.  Pcons: A neural‐network–based consensus predictor that improves fold recognition , 2001, Protein science : a publication of the Protein Society.

[65]  K. S. Arun,et al.  Least-Squares Fitting of Two 3-D Point Sets , 1987, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Ronald Fedkiw,et al.  Level set methods and dynamic implicit surfaces , 2002, Applied mathematical sciences.

[67]  M. L. Connolly Solvent-accessible surfaces of proteins and nucleic acids. , 1983, Science.

[68]  P. Kraulis A program to produce both detailed and schematic plots of protein structures , 1991 .

[69]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[70]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[71]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[72]  J. Gough The SUPERFAMILY database in structural genomics. , 2002, Acta crystallographica. Section D, Biological crystallography.

[73]  A. Godzik The structural alignment between two proteins: Is there a unique answer? , 1996, Protein science : a publication of the Protein Society.

[74]  P E Bourne,et al.  An alternative view of protein fold space , 2000, Proteins.

[75]  H Edelsbrunner,et al.  Analytical shape computation of macromolecules: II. Inaccessible cavities in proteins , 1998, Proteins.

[76]  I D Kuntz,et al.  A rapid method for exploring the protein structure universe , 1999, Proteins.

[77]  S. Umeyama,et al.  Least-Squares Estimation of Transformation Parameters Between Two Point Patterns , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[78]  Trevor J. Hastie,et al.  3-D curve matching using splines , 1991, J. Field Robotics.

[79]  David L. Wild,et al.  Protein analyst - a distributed object environment for protein sequence and structure analysis , 1999, Bioinform..