Data Mining of Macromolecular Structures.

The use of macromolecular structures is widespread for a variety of applications, from teaching protein structure principles all the way to ligand optimization in drug development. Applying data mining techniques on these experimentally determined structures requires a highly uniform, standardized structural data source. The Protein Data Bank (PDB) has evolved over the years toward becoming the standard resource for macromolecular structures. However, the process selecting the data most suitable for specific applications is still very much based on personal preferences and understanding of the experimental techniques used to obtain these models. In this chapter, we will first explain the challenges with data standardization, annotation, and uniformity in the PDB entries determined by X-ray crystallography. We then discuss the specific effect that crystallographic data quality and model optimization methods have on structural models and how validation tools can be used to make informed choices. We also discuss specific advantages of using the PDB_REDO databank as a resource for structural data. Finally, we will provide guidelines on how to select the most suitable protein structure models for detailed analysis and how to select a set of structure models suitable for data mining.

[1]  Martin Frank,et al.  Carbohydrate Structure Suite (CSS): analysis of carbohydrate 3D structures derived from the PDB , 2004, Nucleic Acids Res..

[2]  Haruki Nakamura,et al.  PDBML: the representation of archival macromolecular structure data in XML , 2005, Bioinform..

[3]  Sameer Velankar,et al.  The role of structural bioinformatics resources in the era of integrative structural biology , 2013, Acta crystallographica. Section D, Biological crystallography.

[4]  Haruki Nakamura,et al.  Remediation of the protein data bank archive , 2007, Nucleic Acids Res..

[5]  T. Jones,et al.  Between objectivity and subjectivity , 1990, Nature.

[6]  Fei Long,et al.  The PDB_REDO server for macromolecular structure model optimization , 2014, IUCrJ.

[7]  Ian J. Tickle,et al.  Statistical quality indicators for electron-density maps , 2012, Acta crystallographica. Section D, Biological crystallography.

[8]  M. S. Chapman,et al.  The Putative Catalytic Bases Have, at Most, an Accessory Role in the Mechanism of Arginine Kinase* , 2003, Journal of Biological Chemistry.

[9]  Haruki Nakamura,et al.  The Protein Data Bank archive as an open data resource , 2014, Journal of Computer-Aided Molecular Design.

[10]  Michael Levitt,et al.  Redundancy-weighting for better inference of protein structural features , 2014, Bioinform..

[11]  Vincent B. Chen,et al.  Correspondence e-mail: , 2000 .

[12]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[13]  Dennis E Danley,et al.  Crystallization to obtain protein-ligand complexes for structure-aided drug design. , 2006, Acta Crystallographica Section D: Biological Crystallography.

[14]  Z. Dauter,et al.  In defence of our science--validation now! , 2010, Acta crystallographica. Section D, Biological crystallography.

[15]  Philip R. Evans,et al.  An introduction to data reduction: space-group determination, scaling and intensity statistics , 2011, Acta crystallographica. Section D, Biological crystallography.

[16]  Gert Vriend,et al.  Everyday , 2020, Oxford Research Encyclopedia of Literature.

[17]  Alexandre M J J Bonvin,et al.  DRESS: a database of REfined solution NMR structures , 2004, Proteins.

[18]  S. McNicholas,et al.  Presenting your structures: the CCP4mg molecular-graphics software , 2011, Acta crystallographica. Section D, Biological crystallography.

[19]  Maria Jesus Martin,et al.  SIFTS: Structure Integration with Function, Taxonomy and Sequences resource , 2012, Nucleic Acids Res..

[20]  H. Berman,et al.  New parameters for the refinement of nucleic acid-containing structures. , 1996, Acta crystallographica. Section D, Biological crystallography.

[21]  Radka Svobodová Vareková,et al.  ValidatorDB: database of up-to-date validation results for ligands and non-standard residues from the Protein Data Bank , 2014, Nucleic Acids Res..

[22]  Placement of molecules in (not out of) the cell. , 2013, Acta crystallographica. Section D, Biological crystallography.

[23]  J. Bolin,et al.  Crystal structures of Escherichia coli and Lactobacillus casei dihydrofolate reductase refined at 1.7 A resolution. I. General features and binding of methotrexate. , 1982, The Journal of biological chemistry.

[24]  Gabriele Cavallaro,et al.  MetalPDB: a database of metal sites in biological macromolecular structures , 2012, Nucleic Acids Res..

[25]  R. Jernigan,et al.  Self‐consistent estimation of inter‐residue protein contact energies based on an equilibrium mixture approximation of residues , 1999, Proteins.

[26]  Bernhard Rupp,et al.  Visualizing ligand molecules in Twilight electron density. , 2013, Acta crystallographica. Section F, Structural biology and crystallization communications.

[27]  L B Kier,et al.  Molecular orbital calculation of preferred conformations of acetylcholine, muscarine, and muscarone. , 1967, Molecular pharmacology.

[28]  Randy J. Read,et al.  A New Generation of Crystallographic Validation Tools for the Protein Data Bank , 2011, Structure.

[29]  M. Baker,et al.  Outcome of the First Electron Microscopy Validation Task Force Meeting , 2012, Structure.

[30]  Kun-Yi Hsin,et al.  MESPEUS: a database of the geometry of metal sites in proteins , 2008 .

[31]  Bernhard Rupp,et al.  Scientific inquiry, inference and critical reasoning in the macromolecular crystallography curriculum , 2010 .

[32]  Kevin Cowtan,et al.  The Buccaneer software for automated model building. 1. Tracing protein chains. , 2006, Acta crystallographica. Section D, Biological crystallography.

[33]  Gert Vriend,et al.  Increasing the precision of comparative models with YASARA NOVA—a self‐parameterizing force field , 2002, Proteins.

[34]  Haruki Nakamura,et al.  Data Deposition and Annotation at the Worldwide Protein Data Bank , 2009, Molecular biotechnology.

[35]  Peter Güntert,et al.  Automated structure determination from NMR spectra , 2009, European Biophysics Journal.

[36]  Julia Brasch,et al.  Structures from Anomalous Diffraction of Native Biological Macromolecules , 2012, Science.

[37]  Chris Sander,et al.  Objectively judging the quality of a protein structure from a Ramachandran plot , 1997, Comput. Appl. Biosci..

[38]  F. Allen The Cambridge Structural Database: a quarter of a million crystal structures and rising. , 2002, Acta crystallographica. Section B, Structural science.

[39]  Thomas Terwilliger,et al.  SOLVE and RESOLVE: automated structure solution, density modification and model building. , 2004, Journal of synchrotron radiation.

[40]  Vincent Breton,et al.  PDB_REDO: automated re-refinement of X-ray structure models in the PDB , 2009, Journal of applied crystallography.

[41]  Sameer Velankar,et al.  Implementing an X-ray validation pipeline for the Protein Data Bank , 2012, Acta crystallographica. Section D, Biological crystallography.

[42]  P. Andrew Karplus,et al.  Linking Crystallographic Model and Data Quality , 2012, Science.

[43]  Gerhard Klebe,et al.  AffinDB: a freely accessible database of affinities for protein–ligand complexes from the PDB , 2005, Nucleic Acids Res..

[44]  G. Kleywegt Use of non-crystallographic symmetry in protein structure refinement. , 1996, Acta crystallographica. Section D, Biological crystallography.

[45]  Kevin Cowtan,et al.  research papers Acta Crystallographica Section D Biological , 2005 .

[46]  B. Schmitt,et al.  Performance of single-photon-counting PILATUS detector modules , 2009, Journal of synchrotron radiation.

[47]  Krista Joosten,et al.  PDB_REDO: constructive validation, more than just looking for errors , 2012, Acta crystallographica. Section D, Biological crystallography.

[48]  J. Thornton,et al.  PROCHECK: a program to check the stereochemical quality of protein structures , 1993 .

[49]  A. Brunger Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. , 1992 .

[50]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[51]  Anastassis Perrakis,et al.  Automated protein model building combined with iterative structure refinement , 1999, Nature Structural Biology.

[52]  Clemens Vonrhein,et al.  Exploiting structure similarity in refinement: automated NCS and target-structure restraints in BUSTER , 2012, Acta crystallographica. Section D, Biological crystallography.

[53]  G. Montelione,et al.  Recommendations of the wwPDB NMR Validation Task Force. , 2013, Structure.

[54]  C. Sander,et al.  Errors in protein structures , 1996, Nature.

[55]  Haruki Nakamura,et al.  The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data , 2006, Nucleic Acids Res..

[56]  Gert Vriend,et al.  Anomalies in the refinement of isoleucine , 2014, Acta crystallographica. Section D, Biological crystallography.

[57]  Michael G Prisant,et al.  Crystallographic model validation: from diagnosis to healing. , 2013, Current opinion in structural biology.

[58]  W. Delano The PyMOL Molecular Graphics System , 2002 .

[59]  T. A. Jones,et al.  The Uppsala Electron-Density Server. , 2004, Acta crystallographica. Section D, Biological crystallography.

[60]  Edwin Pozharski,et al.  Techniques, tools and best practices for ligand electron-density analysis and results from their application to deposited crystal structures. , 2013, Acta crystallographica. Section D, Biological crystallography.

[61]  G. N. Ramachandran,et al.  Stereochemistry of polypeptide chain configurations. , 1963, Journal of molecular biology.

[62]  Gerard J Kleywegt,et al.  ValLigURL: a server for ligand-structure comparison and validation. , 2007, Acta crystallographica. Section D, Biological crystallography.

[63]  Jie Luo,et al.  Retrieval of Crystallographically-Derived Molecular Geometry Information , 2004, J. Chem. Inf. Model..

[64]  Geoffrey Chang,et al.  Retraction for Ma and Chang, Structure of the multidrug resistance efflux transporter EmrE from Escherichia coli , 2007, Proceedings of the National Academy of Sciences.

[65]  Paul D. Adams,et al.  phenix.model_vs_data: a high-level tool for the calculation of crystallographic model and data statistics , 2010, Journal of applied crystallography.

[66]  Kevin Cowtan,et al.  Validation of carbohydrate structures in CCP4 6.5 , 2015 .

[67]  M. Nilges,et al.  Bayesian estimation of NMR restraint potential and weight: A validation on a representative set of protein structures , 2011, Proteins.

[68]  R. Jernigan,et al.  Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. , 1996, Journal of molecular biology.

[69]  Miron Livny,et al.  RECOORD: A recalculated coordinate database of 500+ proteins from the PDB using restraints from the BioMagResBank , 2005, Proteins.

[70]  N. O. Manning,et al.  The protein data bank , 1999, Genetica.

[71]  Bernhard Rupp,et al.  Detection and analysis of unusual features in the structural model and structure-factor data of a birch pollen allergen , 2012, Acta crystallographica. Section F, Structural biology and crystallization communications.

[72]  Philip R. Evans,et al.  How good are my data and what is the resolution? , 2013, Acta crystallographica. Section D, Biological crystallography.

[73]  John P. Overington,et al.  Knowledge‐based protein modelling and design , 1988 .

[74]  W. C. Hamilton Significance tests on the crystallographic R factor , 1965 .

[75]  Adrià Cereto-Massagué,et al.  The good, the bad and the dubious: VHELIBS, a validation helper for ligands and binding sites , 2013, Journal of Cheminformatics.

[76]  Anastassis Perrakis,et al.  Automatic rebuilding and optimization of crystallographic structures in the Protein Data Bank , 2011, Bioinform..

[77]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[78]  Andrzej Joachimiak,et al.  High-throughput crystallography for structural genomics. , 2009, Current opinion in structural biology.

[79]  Geoffrey Chang,et al.  Retraction of "Structure of MsbA from Vibrio cholera: a multidrug resistance ABC transporter homolog in a closed conformation" [J. Mol. Biol. (2003) 330 419-430]. , 2007, Journal of molecular biology.

[80]  H. Berman,et al.  The future of the Protein Data Bank. , 2013, Biopolymers.

[81]  Anthony Nicholls,et al.  Essential considerations for using protein-ligand structures in drug discovery. , 2012, Drug discovery today.

[82]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[83]  Gert Vriend,et al.  Re-refinement from deposited X-ray data can deliver improved models for most PDB entries , 2009, Acta crystallographica. Section D, Biological crystallography.

[84]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[85]  T. N. Bhat,et al.  The PDB data uniformity project , 2001, Nucleic Acids Res..

[86]  T. N. Bhat,et al.  The Protein Data Bank: unifying the archive , 2002, Nucleic Acids Res..

[87]  John D. Westbrook,et al.  Representation of viruses in the remediated PDB archive , 2008, Acta crystallographica. Section D, Biological crystallography.

[88]  Ethan A. Merritt,et al.  To B or not to B: a question of resolution? , 2012, Acta crystallographica. Section D, Biological crystallography.

[89]  Gert Vriend,et al.  PDB Improvement Starts with Data Deposition , 2007, Science.

[90]  Zukang Feng,et al.  Automated and accurate deposition of structures solved by X-ray diffraction to the Protein Data Bank. , 2004, Acta crystallographica. Section D, Biological crystallography.

[91]  J. Zou,et al.  Improved methods for building protein models in electron density maps and the location of errors in these models. , 1991, Acta crystallographica. Section A, Foundations of crystallography.

[92]  Paul N. Mortenson,et al.  Diverse, high-quality test set for the validation of protein-ligand docking performance. , 2007, Journal of medicinal chemistry.

[93]  Chris Sander,et al.  The HSSP database of protein structure-sequence alignments , 1993, Nucleic Acids Res..

[94]  Claus-Wilhelm von der Lieth,et al.  pdb-care (PDB CArbohydrate REsidue check): a program to support annotation of complex carbohydrate structures in PDB files , 2004, BMC Bioinformatics.

[95]  Wladek Minor,et al.  Avoidable errors in deposited macromolecular structures: an impediment to efficient data mining , 2014, IUCrJ.

[96]  Randy J. Read,et al.  Using SAD data in Phaser , 2011, Acta crystallographica. Section D, Biological crystallography.