Data-Driven High-Throughput Prediction of the 3-D Structure of Small Molecules: Review and Progress

Accurate prediction of the 3-D structure of small molecules is essential in order to understand their physical, chemical, and biological properties, including how they interact with other molecules. Here, we survey the field of high-throughput methods for 3-D structure prediction and set up new target specifications for the next generation of methods. We then introduce COSMOS, a novel data-driven prediction method that utilizes libraries of fragment and torsion angle parameters. We illustrate COSMOS using parameters extracted from the Cambridge Structural Database (CSD) by analyzing their distribution and then evaluating the system's performance in terms of speed, coverage, and accuracy. Results show that COSMOS represents a significant improvement when compared to state-of-the-art prediction methods, particularly in terms of coverage of complex molecular structures, including metal-organics. COSMOS can predict structures for 96.4% of the molecules in the CSD (99.6% organic, 94.6% metal-organic), whereas the widely used commercial method CORINA predicts structures for 68.5% (98.5% organic, 51.6% metal-organic). On the common subset of molecules predicted by both methods, COSMOS makes predictions with an average speed per molecule of 0.15 s (0.10 s organic, 0.21 s metal-organic) and an average rmsd of 1.57 Å (1.26 Å organic, 1.90 Å metal-organic), and CORINA makes predictions with an average speed per molecule of 0.13s (0.18s organic, 0.08s metal-organic) and an average rmsd of 1.60 Å (1.13 Å organic, 2.11 Å metal-organic). COSMOS is available through the ChemDB chemoinformatics Web portal at http://cdb.ics.uci.edu/ .

[1]  Michael D. Miller,et al.  Comparison of Knowledge-Based and Distance Geometry Approaches for Generation of Molecular Conformations , 2001, J. Chem. Inf. Comput. Sci..

[2]  I. Kuntz,et al.  DOCK 6: combining techniques to model RNA-small molecule complexes. , 2009, RNA.

[3]  Egon L. Willighagen,et al.  The Blue Obelisk—Interoperability in Chemical Informatics , 2006, J. Chem. Inf. Model..

[4]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[5]  Pierre Baldi,et al.  ChemDB update - full-text search and virtual chemical space , 2007, Bioinform..

[6]  Huafeng Xu,et al.  A self-organizing principle for learning nonlinear manifolds , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Johann Gasteiger,et al.  Chemoinformatics: a new field with a long tradition , 2006, Analytical and bioanalytical chemistry.

[8]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[9]  Frank H Allen,et al.  Applications of the Cambridge Structural Database in organic chemistry and crystal chemistry. , 2002, Acta crystallographica. Section B, Structural science.

[10]  Paul A. Bartlett,et al.  CAVEAT: A program to facilitate the design of organic molecules , 1994, J. Comput. Aided Mol. Des..

[11]  Robin Taylor,et al.  Life-science applications of the Cambridge Structural Database. , 2002, Acta crystallographica. Section D, Biological crystallography.

[12]  Pierre Baldi,et al.  Discovery of Power-Laws in Chemical Space , 2008, J. Chem. Inf. Model..

[13]  Lars Malmström,et al.  Automated prediction of CASP‐5 structures using the Robetta server , 2003, Proteins.

[14]  Beatriz Cordero,et al.  Covalent radii revisited. , 2008, Dalton transactions.

[15]  Brian K. Shoichet,et al.  ZINC - A Free Database of Commercially Available Compounds for Virtual Screening , 2005, J. Chem. Inf. Model..

[16]  Michael Groessl,et al.  In vitro anticancer activity and biologically relevant metabolization of organometallic ruthenium complexes with carbohydrate-based ligands. , 2008, Chemistry.

[17]  Jacques Chomilier,et al.  Frog: a FRee Online druG 3D conformation generator , 2007, Nucleic Acids Res..

[18]  Peter A. Kollman,et al.  AMBER, a package of computer programs for applying molecular mechanics, normal mode analysis, molecular dynamics and free energy calculations to simulate the structural and energetic properties of molecules , 1995 .

[19]  Holger Gohlke,et al.  The Amber biomolecular simulation programs , 2005, J. Comput. Chem..

[20]  Gordon M. Crippen,et al.  Note rapid calculation of coordinates from distance matrices , 1978 .

[21]  Pierre Baldi,et al.  ChemDB: a public database of small molecules and related chemoinformatics resources , 2005, Bioinform..

[22]  David Weininger,et al.  SMILES. 2. Algorithm for generation of unique SMILES notation , 1989, J. Chem. Inf. Comput. Sci..

[23]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[24]  Laxmikant V. Kale,et al.  NAMD2: Greater Scalability for Parallel Molecular Dynamics , 1999 .

[25]  C. Pettinari,et al.  Chemical and Biotechnological Developments in Organotin Cancer Chemotherapy , 2006 .

[26]  C Kooperberg,et al.  Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. , 1997, Journal of molecular biology.

[27]  C. Dobson Chemical space and biology , 2004, Nature.

[28]  A. Bondi van der Waals Volumes and Radii , 1964 .

[29]  Pierre Baldi,et al.  Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval , 2007, J. Chem. Inf. Model..

[30]  Robert H. Crabtree,et al.  The organometallic chemistry of the transition metals , 1992 .

[31]  Raimund Mannhold,et al.  Molecular Drug Properties: Measurement and Prediction , 2007 .

[32]  F. Allen The Cambridge Structural Database: a quarter of a million crystal structures and rising. , 2002, Acta crystallographica. Section B, Structural science.

[33]  J. Skolnick,et al.  TOUCHSTONE II: a new approach to ab initio protein structure prediction. , 2003, Biophysical journal.

[34]  J. Gasteiger,et al.  FROM ATOMS AND BONDS TO THREE-DIMENSIONAL ATOMIC COORDINATES : AUTOMATIC MODEL BUILDERS , 1993 .

[35]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[36]  David Weininger,et al.  SMILES, 3. DEPICT. Graphical depiction of chemical structures , 1990, J. Chem. Inf. Comput. Sci..

[37]  T. Halgren Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94 , 1996, J. Comput. Chem..

[38]  J. Gasteiger,et al.  Finding the 3D structure of a molecule in its IR spectrum , 1997 .

[39]  Gerhard Klebe,et al.  Comparison of Automatic Three-Dimensional Model Builders Using 639 X-ray Structures , 1994, J. Chem. Inf. Comput. Sci..

[40]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[41]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[42]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .