Universal Architectural Concepts Underlying Protein Folding Patterns

What is the architectural ‘basis set’ of the observed universe of protein structures? Using information-theoretic inference, we answer this question with a comprehensive dictionary of 1,493 substructural concepts. Each concept represents a topologically-conserved assembly of helices and strands that make contact. Any protein structure can be dissected into instances of concepts from this dictionary. We dissected the world-wide protein data bank and completely inventoried all concept instances. This yields an unprecedented source of biological insights. These include: correlations between concepts and catalytic activities or binding sites, useful for rational drug design; local amino-acid sequence–structure correlations, useful for ab initio structure prediction methods; and information supporting the recognition and exploration of evolutionary relationships, useful for structural studies. An interactive site, Proçodic, at http://lcb.infotech.monash.edu.au/prosodic (click) provides access to and navigation of the entire dictionary of concepts, and all associated information.

[1]  D. Arnon,et al.  Ferredoxins as Electron Carriers in Photosynthesis and in the Biological Production and Consumption of Hydrogen Gas , 1962, Nature.

[2]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[3]  Peter J. Stuckey,et al.  Statistical inference of protein structural alignments using information and compression , 2016, bioRxiv.

[4]  C. Chothia,et al.  Structure of proteins: Packing of a-helices and pleated sheets , 2000 .

[5]  A M Lesk,et al.  Systematic representation of protein folding patterns. , 1995, Journal of molecular graphics.

[6]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[7]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[8]  A Maritan,et al.  Recurrent oligomers in proteins: An optimal scheme reconciling accurate and concise backbone representations in automated folding and design studies , 2000, Proteins.

[9]  Haruki Nakamura,et al.  The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data , 2006, Nucleic Acids Res..

[10]  N. Guex,et al.  SWISS‐MODEL and the Swiss‐Pdb Viewer: An environment for comparative protein modeling , 1997, Electrophoresis.

[11]  Eaton E. Lattman,et al.  Crystal structure of the actin-binding protein actophorin from Acanthamoeba , 1997, Nature Structural Biology.

[12]  E. Getzoff,et al.  Cu,Zn superoxide dismutase structure from a microbial pathogen establishes a class with a conserved dimer interface. , 2000, Journal of molecular biology.

[13]  A. Kister,et al.  Protein Supersecondary Structures , 2013, Methods in Molecular Biology.

[14]  J. Skolnick,et al.  The PDB is a covering set of small protein structures. , 2003, Journal of molecular biology.

[15]  Alexey G. Murzin,et al.  SCOP2 prototype: a new approach to protein structure mining , 2014, Nucleic Acids Res..

[16]  Peter Schuck,et al.  Stability of ligand‐binding domain dimer assembly controls kainate receptor desensitization , 2009, The EMBO journal.

[17]  L. Pauling,et al.  The pleated sheet, a new layer configuration of polypeptide chains. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Lisa N Kinch,et al.  Compact Structure Patterns in Proteins. , 2016, Journal of molecular biology.

[19]  Zukang Feng,et al.  Ligand Depot: a data warehouse for ligands bound to macromolecules , 2004, Bioinform..

[20]  R. Kolodny,et al.  Complex evolutionary footprints revealed in an analysis of reused protein segments of diverse lengths , 2017, Proceedings of the National Academy of Sciences.

[21]  David C. Jones,et al.  CATH--a hierarchic classification of protein domain structures. , 1997, Structure.

[22]  M J Rooman,et al.  Automatic definition of recurrent local structure motifs in proteins. , 1990, Journal of molecular biology.

[23]  A V Finkelstein,et al.  The classification and origins of protein folding patterns. , 1990, Annual review of biochemistry.

[24]  Lloyd Allison,et al.  Coding Ockham's Razor , 2018, Springer International Publishing.

[25]  Steven E Brenner,et al.  SCOPe: Manual Curation and Artifact Removal in the Structural Classification of Proteins - extended Database. , 2017, Journal of molecular biology.

[26]  Arthur M. Lesk,et al.  Introduction to Protein Science: Architecture, Function, and Genomics , 2001 .

[27]  Bohdan Schneider,et al.  A short survey on protein blocks , 2010, Biophysical Reviews.

[28]  Jian Peng,et al.  Template-based protein structure modeling using the RaptorX web server , 2012, Nature Protocols.

[29]  J. Thornton,et al.  PROMOTIF—A program to identify and analyze structural motifs in proteins , 1996, Protein science : a publication of the Protein Society.

[30]  Andras Fiser,et al.  Development of a motif‐based topology‐independent structure comparison method to identify evolutionarily related folds , 2016, Proteins.

[31]  L. Pauling,et al.  The structure of proteins; two hydrogen-bonded helical configurations of the polypeptide chain. , 1951, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Adam Godzik,et al.  Connecting the protein structure universe by using sparse recurring fragments. , 2005, Structure.

[33]  Rajani R Joshi Diversity and motif conservation in protein 3D structural landscape: exploration by a new multivariate simulation method , 2018, Journal of Molecular Modeling.

[34]  D. Baker,et al.  Prediction of local structure in proteins using a library of sequence-structure motifs. , 1998, Journal of molecular biology.

[35]  F. Richards,et al.  Identification of structural motifs from protein coordinate data: Secondary structure and first‐level supersecondary structure * , 1988, Proteins.

[36]  S. Kim,et al.  Crystal structure of a sweet tasting protein thaumatin I, at 1.65 A resolution. , 1992, Journal of molecular biology.

[37]  D Baker,et al.  Local sequence-structure correlations in proteins. , 1996, Current opinion in biotechnology.

[38]  Ron Unger,et al.  The importance of short structural motifs in protein structure analysis , 1993, J. Comput. Aided Mol. Des..

[39]  M. Distefano,et al.  Structure and function analysis of peptide antagonists of melanoma inhibitor of apoptosis (ML-IAP). , 2003, Biochemistry.

[40]  A. Efimov Super-secondary structures and modeling of protein folds. , 2013, Methods in molecular biology.

[41]  Haruki Nakamura,et al.  Comprehensive structural classification of ligand-binding motifs in proteins. , 2008, Structure.

[42]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[43]  R. Huber,et al.  Rack-induced metal binding vs. flexibility: Met121His azurin crystal structures at different pH. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[44]  David Baker,et al.  Protein Structure Prediction Using Rosetta , 2004, Numerical Computer Methods, Part D.

[45]  Heiko Lammert,et al.  Allostery in the ferredoxin protein motif does not involve a conformational switch , 2011, Proceedings of the National Academy of Sciences.

[46]  Gevorg Grigoryan,et al.  Tertiary alphabet for the observable protein structural universe , 2016, Proceedings of the National Academy of Sciences.

[47]  A. Lesk,et al.  How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. , 1980, Journal of molecular biology.

[48]  G. Rose,et al.  Loops in globular proteins: a novel category of secondary structure. , 1986, Science.

[49]  S. R. Jammalamadaka,et al.  Directional Statistics, I , 2011 .

[50]  David Abramson,et al.  Statistical Compression of Protein Folding Patterns for Inference of Recurrent Substructural Themes , 2017, 2017 Data Compression Conference (DCC).

[51]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[52]  C. Chothia,et al.  Structure of proteins: packing of alpha-helices and pleated sheets. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[53]  J F Boisvieux,et al.  Hidden Markov model approach for identifying the modular framework of the protein backbone. , 1999, Protein engineering.

[54]  M G Rossmann,et al.  Comparison of super-secondary structures in proteins. , 1973, Journal of molecular biology.

[55]  Michael Lappe,et al.  A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3 , 2001, Nucleic Acids Res..

[56]  C. S. Wallace,et al.  Statistical and Inductive Inference by Minimum Message Length (Information Science and Statistics) , 2005 .

[57]  Marcel Tabak,et al.  The structure of the giant haemoglobin from Glossoscolex paulistus. , 2015, Acta crystallographica. Section D, Biological crystallography.

[58]  J L Sussman,et al.  A 3D building blocks approach to analyzing and predicting structure of proteins , 1989, Proteins.

[59]  Richard A Goldstein,et al.  The structure of protein evolution and the evolution of protein structure. , 2008, Current opinion in structural biology.

[60]  Fabian Sievers,et al.  Clustal Omega, accurate alignment of very large numbers of sequences. , 2014, Methods in molecular biology.

[61]  Lloyd Allison,et al.  Minimum message length inference of secondary structure from protein coordinate data , 2012, Bioinform..

[62]  Charlotte M. Deane,et al.  Combining co‐evolution and secondary structure prediction to improve fragment library generation , 2018, Bioinform..

[63]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[64]  S. White,et al.  The "open" and "closed" structures of the type-C inorganic pyrophosphatases from Bacillus subtilis and Streptococcus gordonii. , 2001, Journal of molecular biology.

[65]  J. Thornton,et al.  Understanding nature's catalytic toolkit. , 2005, Trends in biochemical sciences.

[66]  Arun S Konagurthu,et al.  Cataloging topologies of protein folding patterns , 2010, Journal of molecular recognition : JMR.

[67]  D. Rognan Chemogenomic approaches to rational drug design , 2007, British journal of pharmacology.

[68]  Mark A Willis,et al.  Structure of HI0073 from Haemophilus influenzae, the nucleotide‐binding domain of a two‐protein nucleotidyl transferase , 2005, Proteins.

[69]  David Baker,et al.  Protein structure prediction and analysis using the Robetta server , 2004, Nucleic Acids Res..

[70]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[71]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[72]  R. Huber,et al.  The crystal structure of dihydrodipicolinate synthase from Escherichia coli at 2.5 A resolution. , 1995, Journal of molecular biology.

[73]  A M Lesk,et al.  Folding units in globular proteins. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[74]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[75]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[76]  M. Levitt,et al.  Small libraries of protein fragments model native protein structures accurately. , 2002, Journal of molecular biology.

[77]  C. S. Wallace,et al.  Coding Decision Trees , 1993, Machine Learning.

[78]  F A Quiocho,et al.  Target enzyme recognition by calmodulin: 2.4 A structure of a calmodulin-peptide complex. , 1992, Science.

[79]  C Sander,et al.  Structural alignment of globins, phycocyanins and colicin A , 1993, FEBS letters.

[80]  Desmond G. Higgins,et al.  Sequence embedding for fast construction of guide trees for multiple sequence alignment , 2010, Algorithms for Molecular Biology.

[81]  Ricardo A. Mata,et al.  The inhibition mechanism of human 20S proteasomes enables next-generation inhibitor design , 2016, Science.

[82]  Peter J. Stuckey,et al.  Structural search and retrieval using a tableau representation of protein folding patterns , 2008, Bioinform..

[83]  A. Lesk,et al.  Structural determinants of the conformations of medium‐sized loops in proteins , 1989, Proteins.