In Silico Target Profiling of One Billion Molecules

Small molecules are essential for the functioning of biological systems and thus there is an increasing interest in the use of chemistry to probe biology. Challenges in this area include expanding the currently known limits of chemical space, but also recognizing which regions of it may be populated by biologically active molecules. Historical worldwide synthetic efforts have collectively resulted in a commercially available catalogue containing at present around eight million molecules, and the recent coordinated use of high-throughput screening technology across multiple academic centers has generated and made publicly available screening data for chemicals on hundreds of biological assays. However, at this pace of synthesis and testing, at most one can aspire to cover a minute portion of the enormous vastness of chemical space and provide just a glimpse of the bioactivity landscape associated with it. Until complete experimental screening of tens of millions of molecules on thousands of protein targets becomes a viable option, progress in this direction should come from efficient guidance by novel computational approaches combining large scale compound enumeration and in silico screening. In this respect, the recent construction of GDB13 offers access to almost one billion organic small molecules containing up to 13 atoms, constituting the largest public repository of virtual molecules available to date. In addition, initiatives to collect and properly store bioactivity data reported in the scientific literature into public chemogenomic databases have promoted lately the development of ligand-based in silico screening methods that allow for the fast and efficient processing of small molecules on multiple protein targets. The identification of new targets for some old drugs is a highly visible example of the potential of these methodologies in drug discovery. We report here the results of processing all the 977 468 314 molecules present in GDB-13 against ligand-based models derived for 4500 protein targets, which makes it the largest in silico screening campaign ever attempted. Ligand-based in silico screening relies on the assumption that the set of ligands with known bioactivity data for a given protein target (reference molecules) provides a complementary description of the target from a ligand perspective. To process chemical information rapidly and efficiently, all molecular structures need to be encoded using some sort of mathematical descriptors. In this case, we used the low-dimension Shannon entropy descriptors (SHED), a set of just ten real numbers that capture the variability of all feature-pair distributions derived directly from the topology of the molecule. A total of 2700 CPU hours were needed to compute SHED for all compounds in GDB-13. Then, the bioactivity of each virtual molecule from GDB-13 for a given target is estimated by inverse SHED distance weighting interpolation of the bioactivity landscape defined by all neighboring reference molecules. The processing of all compounds in GDB-13 against the 4500 ligand-based target models required 76 000 CPU hours. All computations were performed in a 96-CPU linux cluster. The results reveal that as much as 45.8 % of GDB-13 is currently found outside the current applicability domain of the method, meaning that for 448 004 750 molecules not a single molecule with known bioactivity data could be found within the prevalidated SHED distance cut-off of 0.52, and thus no bioactivity prediction could be made. This is not a surprising outcome considering that GDB-13 contains only molecules up to 13 atoms and that molecules of this size are largely underrepresented in public chemogenomic databases. Of the remaining 54.2 % of GDB-13 for which at least one bioactivity prediction can be made, 10.0 % of the compounds had predicted bioactivity values to any target above 10 mM (pAct 5) and 24.8 % was found to have a predicted bioactivity for at least one target below 0.1 mM (pAct 7). These two sets, composed exactly of 97 795 843 and 242 060 255 molecules, respectively, constitute the predicted inactive and active sets of GDB-13 and, in the remainder of this communication, they will be referred to as iGDB-13 and aGDB-13. The vast majority of molecules in GDB-13 are relatively similar in terms of size and hydrophobicity, with values of molecular weight (MW) and clogP being in the narrow ranges of [175,185] and [0.0,1.0], respectively. Accordingly, the set of predicted bioactives distinguish themselves from the set of predicted inactives only by the presence and ar-

[1]  Stuart L Schreiber,et al.  Small molecules: the missing link in the central dogma , 2005, Nature chemical biology.

[2]  Sivaraman Dandapani,et al.  Grand challenge commentary: Accessing new chemical space for 'undruggable' targets. , 2010, Nature chemical biology.

[3]  S. Ekins,et al.  In silico pharmacology for drug discovery: methods for virtual ligand screening and profiling , 2007, British journal of pharmacology.

[4]  J. Mestres,et al.  Chemical probes for biological systems. , 2011, Drug discovery today.

[5]  Martin Jones,et al.  IUPHAR-DB: the IUPHAR database of G protein-coupled receptors and ion channels , 2008, Nucleic Acids Res..

[6]  Lorenz C. Blum,et al.  970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. , 2009, Journal of the American Chemical Society.

[7]  J. Mestres,et al.  In Silico Receptorome Screening of Antipsychotic Drugs , 2010, Molecular informatics.

[8]  Jordi Mestres,et al.  SHED: Shannon Entropy Descriptors from Topological Feature Distributions , 2006, J. Chem. Inf. Model..

[9]  Evan Bolton,et al.  An overview of the PubChem BioAssay resource , 2009, Nucleic Acids Res..

[10]  Michael J. Keiser,et al.  Predicting new molecular targets for known drugs , 2009, Nature.

[11]  Xin Wen,et al.  BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities , 2006, Nucleic Acids Res..

[12]  Michal Vieth,et al.  Dependence of molecular properties on proteomic family for marketed oral drugs. , 2006, Journal of medicinal chemistry.

[13]  A. Hopkins,et al.  Navigating chemical space for biology and medicine , 2004, Nature.

[14]  R. Solé,et al.  The topology of drug-target interaction networks: implicit dependence on drug properties and target families. , 2009, Molecular bioSystems.

[15]  Alexander Chuprina,et al.  Drug- and Lead-likeness, Target Class, and Molecular Diversity Analysis of 7.9 Million Commercially Available Organic Compounds Provided by 29 Suppliers , 2010, J. Chem. Inf. Model..

[16]  Michael J. Keiser,et al.  Relating protein pharmacology by ligand chemistry , 2007, Nature Biotechnology.

[17]  Jordi Mestres,et al.  In silico directed chemical probing of the adenosine receptor family. , 2010, Bioorganic & medicinal chemistry.

[18]  D. Bertrand,et al.  Discovery of NMDA Glycine Site Inhibitors from the Chemical Universe Database GDB , 2008, ChemMedChem.