Visual Characterization and Diversity Quantification of Chemical Libraries: 1. Creation of Delimited Reference Chemical Subspaces

High-throughput screening (HTS) is a well-established technology which can test up to several million compounds in a few weeks. Despite these appealing capabilities, available resources and high costs may limit the number of molecules screened, making diversity analysis a method of choice to design and prioritize screening libraries. With a constantly increasing number of molecules available for screening, chemical space has become a key concept for visualizing, analyzing, and comparing chemical libraries. In this first article, we present a new method to build delimited reference chemical subspaces (DRCS). A set of 16 million screening compounds from 73 chemical providers has been gathered, resulting in a database of 6.63 million standardized and unique molecules. These molecules have been used to create three DRCS using three different sets of chemical descriptors. A robust principal component analysis model for each space has been obtained, whereby molecules are projected in a reduced two-dimensional viewable space. The specificity of our approach is that each reduced space has been delimited by a representative contour encompassing a very large proportion of molecules and reflecting its overall shape. The methodology is illustrated by mapping and comparing various chemical libraries. Several tools used in these studies are made freely available, thus enabling any user to compute DRCS matching specific requirements.

[1]  Kuo-Chen Chou,et al.  Assessment of chemical libraries for their druggability , 2005, Comput. Biol. Chem..

[2]  J J Baldwin,et al.  Prediction of drug absorption using multivariate statistics. , 2000, Journal of medicinal chemistry.

[3]  G. Schneider,et al.  Voyages to the (un)known: adaptive design of bioactive compounds. , 2009, Trends in biotechnology.

[4]  Peter Willett,et al.  Computational methods for the analysis of molecular diversity , 1996 .

[5]  M. Lajiness Dissimilarity-based compound selection techniques , 1996 .

[6]  W. Guida,et al.  The art and practice of structure‐based drug design: A molecular modeling perspective , 1996, Medicinal research reviews.

[7]  Lorenz C. Blum,et al.  Chemical space as a source for new drugs , 2010 .

[8]  Ian T. Nabney,et al.  Data Visualization during the Early Stages of Drug Discovery , 2006, J. Chem. Inf. Model..

[9]  Stéphane Bourg,et al.  Collections of Compounds – How to Deal with them? , 2008 .

[10]  T. Halgren Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94 , 1996, J. Comput. Chem..

[11]  Tudor I. Oprea,et al.  Pharmacokinetically based mapping device for chemical space navigation. , 2002, Journal of combinatorial chemistry.

[12]  U Schopfer,et al.  Molecular diversity management strategies for building and enhancement of diverse and focused lead discovery compound screening collections. , 2004, Combinatorial chemistry & high throughput screening.

[13]  Jean-Louis Reymond,et al.  A Searchable Map of PubChem , 2010, J. Chem. Inf. Model..

[14]  H. Verheij,et al.  Leadlikeness and structural diversity of synthetic screening libraries , 2006, Molecular Diversity.

[15]  Tudor I. Oprea,et al.  The Design of Leadlike Combinatorial Libraries. , 1999, Angewandte Chemie.

[16]  Antonio Macchiarulo,et al.  Exploring the other side of biologically relevant chemical space: insights into carboxylic, sulfonic and phosphonic acid bioisosteric relationships. , 2007, Journal of molecular graphics & modelling.

[17]  Jürgen Bajorath,et al.  Evaluation of Descriptors and Mini-Fingerprints for the Identification of Molecules with Similar Activity , 2000, J. Chem. Inf. Comput. Sci..

[18]  Jean-Louis Reymond,et al.  Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F: Assembly of 26.4 Million Structures (110.9 Million Stereoisomers) and Analysis for New Ring Systems, Stereochemistry, Physicochemical Properties, Compound Classes, and Drug Discovery , 2007, J. Chem. Inf. Model..

[19]  Andreas Bender,et al.  “Plate Cherry Picking”: A Novel Semi-Sequential Screening Paradigm for Cheaper, Faster, Information-Rich Compound Selection , 2007, Journal of biomolecular screening.

[20]  Tudor I. Oprea,et al.  Chemography: the Art of Navigating in Chemical Space , 2000 .

[21]  Egon L. Willighagen,et al.  Bioclipse: an open source workbench for chemo- and bioinformatics , 2007, BMC Bioinformatics.

[22]  Tudor I. Oprea,et al.  Chemical database preparation for compound acquisition or virtual screening. , 2006, Methods in molecular biology.

[23]  Tudor I. Oprea,et al.  Chemical space navigation in lead discovery. , 2002, Current opinion in chemical biology.

[24]  A. Schuffenhauer,et al.  Charting biologically relevant chemical space: a structural classification of natural products (SCONP). , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[25]  P Willett,et al.  Chemoinformatics - similarity and diversity in chemical libraries. , 2000, Current opinion in biotechnology.

[26]  Dimitris K. Agrafiotis,et al.  Multidimensional scaling and visualization of large molecular similarity tables , 2001 .

[27]  Egon L. Willighagen,et al.  CDK-Taverna: an open workflow environment for cheminformatics , 2010, BMC Bioinformatics.

[28]  Valler,et al.  Diversity screening versus focussed screening in drug discovery. , 2000, Drug discovery today.

[29]  Robert D. Brown Descriptors for diversity analysis , 1996 .

[30]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[31]  Rees,et al.  Molecular diversity and its analysis. , 1999, Drug discovery today.

[32]  Igor I Baskin,et al.  Chemoinformatics as a Theoretical Chemistry Discipline , 2011, Molecular informatics.

[33]  Anang A Shelat,et al.  The interdependence between screening methods and screening libraries. , 2007, Current opinion in chemical biology.

[34]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[35]  Stefan Wetzel,et al.  Corrigendum: Interactive exploration of chemical space with Scaffold Hunter , 2009 .

[36]  Andreas Bender,et al.  Plate-Based Diversity Selection Based on Empirical HTS Data to Enhance the Number of Hits and Their Chemical Diversity , 2009, Journal of biomolecular screening.

[37]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[38]  Jürgen Bajorath,et al.  Differential Shannon Entropy as a Sensitive Measure of Differences in Database Variability of Molecular Descriptors , 2001, J. Chem. Inf. Comput. Sci..

[39]  Jürgen Bajorath,et al.  Variability of Molecular Descriptors in Compound Databases Revealed by Shannon Entropy Calculations , 2000, J. Chem. Inf. Comput. Sci..

[40]  Fred A. Hamprecht,et al.  Chemical Library Subset Selection Algorithms: A Unified Derivation Using Spatial Statistics , 2002, J. Chem. Inf. Comput. Sci..

[41]  Tudor I. Oprea,et al.  Oncology exploration: charting cancer medicinal chemistry space. , 2006, Drug discovery today.

[42]  L Xue,et al.  Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. , 2000, Combinatorial chemistry & high throughput screening.

[43]  Alexander Chuprina,et al.  Drug- and Lead-likeness, Target Class, and Molecular Diversity Analysis of 7.9 Million Commercially Available Organic Compounds Provided by 29 Suppliers , 2010, J. Chem. Inf. Model..

[44]  Igor I Baskin,et al.  The One‐Class Classification Approach to Data Description and to Models Applicability Domain , 2010, Molecular informatics.

[45]  Jianhua Z. Huang,et al.  SPARSE LOGISTIC PRINCIPAL COMPONENTS ANALYSIS FOR BINARY DATA. , 2010, The annals of applied statistics.

[46]  Roberto Todeschini,et al.  Molecular descriptors for chemoinformatics , 2009 .

[47]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[48]  Ronald L. Graham,et al.  An Efficient Algorithm for Determining the Convex Hull of a Finite Planar Set , 1972, Inf. Process. Lett..

[49]  C. Dobson Chemical space and biology , 2004, Nature.

[50]  N. Baurin Drug-Like Annotation and Duplicate Analysis of a 23-Supplier Chemical Database Totalling 2.7 Million Compounds. , 2004 .

[51]  P. Labute A widely applicable set of descriptors. , 2000, Journal of molecular graphics & modelling.

[52]  Jürgen Bajorath,et al.  Chemical Descriptors with Distinct Levels of Information Content and Varying Sensitivity to Differences between Selected Compound Databases Identified by SE-DSE Analysis , 2002, J. Chem. Inf. Comput. Sci..

[53]  Alban Arrault,et al.  Managing, profiling and analyzing a library of 2.6 million compounds gathered from 32 chemical providers , 2006, Molecular Diversity.

[54]  Jürgen Bajorath,et al.  Molecular Descriptors for Effective Classification of Biologically Active Compounds Based on Principal Component Analysis Identified by a Genetic Algorithm , 2000, J. Chem. Inf. Comput. Sci..

[55]  Sorel Muresan,et al.  ChemGPS-NP: tuned for navigation in biologically relevant chemical space. , 2007, Journal of natural products.

[56]  Rajarshi Guha,et al.  Chemoinformatic Analysis of Combinatorial Libraries, Drugs, Natural Products, and Molecular Libraries Small Molecule Repository , 2009, J. Chem. Inf. Model..

[57]  Dimitris K. Agrafiotis,et al.  Multidimensional scaling and visualization of large molecular similarity tables , 2001, J. Comput. Chem..

[58]  A. Gorse Diversity in medicinal chemistry space. , 2006, Current topics in medicinal chemistry.

[59]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[60]  M. Congreve,et al.  A 'rule of three' for fragment-based lead discovery? , 2003, Drug discovery today.

[61]  Peter Meier,et al.  Key aspects of the Novartis compound collection enhancement project for the compilation of a comprehensive chemogenomics drug discovery screening collection. , 2005, Current topics in medicinal chemistry.

[62]  Johann Gasteiger,et al.  Assessing Similarity and Diversity of Combinatorial Libraries by Spatial Autocorrelation Functions and Neural Networks , 1996 .

[63]  Jun Xu,et al.  Selecting Diversified Compounds to Build a Tangible Library for Biological and Biochemical Assays , 2010, Molecules.

[64]  M. Giese,et al.  Norm-based face encoding by single neurons in the monkey inferotemporal cortex , 2006, Nature.

[65]  Stefan Wetzel,et al.  The Scaffold Tree - Visualization of the Scaffold Universe by Hierarchical Scaffold Classification , 2007, J. Chem. Inf. Model..

[66]  Brian K. Shoichet,et al.  Virtual screening of chemical libraries , 2004, Nature.

[67]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[68]  J. Irwin,et al.  ZINC ? A Free Database of Commercially Available Compounds for Virtual Screening. , 2005 .

[69]  Jürgen Bajorath,et al.  Design and Exploration of Target-Selective Chemical Space Representations , 2008, J. Chem. Inf. Model..

[70]  José L. Medina-Franco,et al.  Visualization of the Chemical Space in Drug Discovery , 2008 .

[71]  C. Steinbeck,et al.  Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. , 2006, Current pharmaceutical design.

[72]  Tudor I. Oprea,et al.  Pursuing the leadlikeness concept in pharmaceutical research. , 2004, Current opinion in chemical biology.

[73]  Clemencia Pinilla,et al.  A Similarity‐based Data‐fusion Approach to the Visual Characterization and Comparison of Compound Databases , 2007, Chemical biology & drug design.

[74]  Lahana,et al.  How many leads from HTS? , 1999, Drug discovery today.

[75]  Jérôme Hert,et al.  Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures , 2004, J. Chem. Inf. Model..

[76]  Thierry Kogej,et al.  ChemGPS-NPWeb: chemical space navigation online , 2009, J. Comput. Aided Mol. Des..

[77]  H Matter,et al.  Random or rational design? Evaluation of diverse compound subsets from chemical structure databases. , 1998, Journal of medicinal chemistry.

[78]  Melissa R. Landon,et al.  JEDA: Joint entropy diversity analysis. An information-theoretic method for choosing diverse and representative subsets from combinatorial libraries , 2006, Molecular Diversity.

[79]  Stephen D. Pickett,et al.  Partition-based selection , 1996 .

[80]  P Schneider,et al.  Self-organizing maps in drug discovery: compound library design, scaffold-hopping, repurposing. , 2009, Current medicinal chemistry.

[81]  Tudor I. Oprea,et al.  Property distribution of drug-related chemical databases* , 2000, J. Comput. Aided Mol. Des..

[82]  Jürgen Bajorath,et al.  Differential Shannon Entropy Analysis Identifies Molecular Property Descriptors that Predict Aqueous Solubility of Synthetic Compounds with High Accuracy in Binary QSAR Calculations , 2002, J. Chem. Inf. Comput. Sci..

[83]  R. Cramer,et al.  Toward general methods of targeted library design: topomer shape similarity searching with diverse structures as queries. , 2000, Journal of medicinal chemistry.

[84]  Dan C. Fara,et al.  Lead-like, drug-like or “Pub-like”: how different are they? , 2007, J. Comput. Aided Mol. Des..

[85]  Lorenz C. Blum,et al.  970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. , 2009, Journal of the American Chemical Society.

[86]  Aristides Gionis,et al.  What is the Dimension of Your Binary Data? , 2006, Sixth International Conference on Data Mining (ICDM'06).