Learning from the Data: Mining of Large High-Throughput Screening Databases

High-throughput screening (HTS) campaigns in pharmaceutical companies have accumulated a large amount of data for several million compounds over a couple of hundred assays. Despite the general awareness that rich information is hidden inside the vast amount of data, little has been reported for a systematic data mining method that can reliably extract relevant knowledge of interest for chemists and biologists. We developed a data mining approach based on an algorithm called ontology-based pattern identification (OPI) and applied it to our in-house HTS database. We identified nearly 1500 scaffold families with statistically significant structure-HTS activity profile relationships. Among them, dozens of scaffolds were characterized as leading to artifactual results stemming from the screening technology employed, such as assay format and/or readout. Four types of compound scaffolds can be characterized based on this data mining effort: tumor cytotoxic, general toxic, potential reporter gene assay artifact, and target family specific. The OPI-based data mining approach can reliably identify compounds that are not only structurally similar but also share statistically significant biological activity profiles. Statistical tests such as Kruskal-Wallis test and analysis of variance (ANOVA) can then be applied to the discovered scaffolds for effective assignment of relevant biological information. The scaffolds identified by our HTS data mining efforts are an invaluable resource for designing SAR-robust diversity libraries, generating in silico biological annotations of compounds on a scaffold basis, and providing novel target family specific scaffolds for focused compound library design.

[1]  B. Shoichet,et al.  Identification and prediction of promiscuous aggregating inhibitors among known drugs. , 2003, Journal of medicinal chemistry.

[2]  Gregory Piatetsky-Shapiro,et al.  Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[3]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[4]  Peter Meier,et al.  Key aspects of the Novartis compound collection enhancement project for the compilation of a comprehensive chemogenomics drug discovery screening collection. , 2005, Current topics in medicinal chemistry.

[5]  D J Diller,et al.  The different strategies for designing GPCR and kinase targeted libraries. , 2004, Combinatorial chemistry & high throughput screening.

[6]  G. S. Johnson,et al.  An Information-Intensive Approach to the Molecular Pharmacology of Cancer , 1997, Science.

[7]  J. Weinstein,et al.  Pharmacogenomic analysis: correlating molecular substructure classes with microarray gene expression data , 2002, The Pharmacogenomics Journal.

[8]  M. Jordan,et al.  Microtubules as a target for anticancer drugs , 2004, Nature Reviews Cancer.

[9]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[10]  Ruili Huang,et al.  Linking tumor cell cytotoxicity to mechanism of drug action: An integrated analysis of gene expression, small‐molecule screening and structural databases , 2005, Proteins.

[11]  E. Jacoby,et al.  Chemogenomics: an emerging strategy for rapid target and drug discovery , 2004, Nature Reviews Genetics.

[12]  Alain Calvet,et al.  Molecular Property eXplorer: A Novel Approach to Visualizing SAR Using Tree-Maps and Heatmaps , 2005, J. Chem. Inf. Model..

[13]  P Willett,et al.  Comparison of algorithms for dissimilarity-based compound selection. , 1997, Journal of molecular graphics & modelling.

[14]  R A Goodnow,et al.  Library design practices for success in lead generation with small molecule libraries. , 2003, Combinatorial chemistry & high throughput screening.

[15]  Stephen D. Pickett,et al.  Research Papers) Design of a Compound Screening Collection for use in High Throughput Screening , 2004 .

[16]  Dragos Horvath,et al.  Neighborhood Behavior of in Silico Structural Spaces with Respect to in Vitro Activity Spaces-A Novel Understanding of the Molecular Similarity Principle in the Context of Multiple Receptor Binding Profiles , 2003, J. Chem. Inf. Comput. Sci..

[17]  L. Gianni,et al.  Anthracyclines: Molecular Advances and Pharmacologic Developments in Antitumor Activity and Cardiotoxicity , 2004, Pharmacological Reviews.

[18]  Stephan Heyse,et al.  From targets to leads: the importance of advanced data analysis for decision support in drug discovery. , 2005, Current opinion in drug discovery & development.

[19]  Peter Willett,et al.  Evaluation of molecular similarity and molecular diversity methods using biological activity data. , 2004, Methods in molecular biology.

[20]  S Stanley Young,et al.  Using recursive partitioning analysis to evaluate compound selection methods. , 2004, Methods in molecular biology.

[21]  B. Shoichet,et al.  A common mechanism underlying promiscuous inhibitors from virtual and high-throughput screening. , 2002, Journal of medicinal chemistry.

[22]  B. Stockwell Exploring biology with small organic molecules , 2004, Nature.

[23]  Hugo Kubinyi,et al.  Chemogenomics in Drug Discovery: A Medicinal Chemistry Perspective , 2004 .

[24]  Mark C. Fishman,et al.  Pharmaceuticals: A new grammar for drug discovery , 2005, Nature.

[25]  M. Whittaker,et al.  Discovery of protease inhibitors using targeted libraries. , 1998, Current opinion in chemical biology.

[26]  Nicolas Froloff,et al.  Probing drug action using in vitro pharmacological profiles. , 2005, Trends in biotechnology.

[27]  Y. Martin,et al.  Challenges and prospects for computational aids to molecular diversity , 1996 .

[28]  Jing Li,et al.  Novel Statistical Approach for Primary High-Throughput Screening Hit Selection , 2005, J. Chem. Inf. Model..

[29]  Dragos Horvath,et al.  Neighborhood Behavior of in Silico Structural Spaces with Respect to In Vitro Activity Spaces-A Benchmark for Neighborhood Behavior Assessment of Different in Silico Similarity Metrics , 2003, J. Chem. Inf. Comput. Sci..

[30]  Christian N. Parker,et al.  Application of Chemoinformatics to High-Throughput Screening , 2004 .

[31]  Brian K Shoichet,et al.  Kinase inhibitors: not just for kinases anymore. , 2003, Journal of medicinal chemistry.

[32]  Gisbert Schneider,et al.  Status of HTS Data Mining Approaches , 2004 .

[33]  T. Webb,et al.  Current directions in the evolution of compound libraries. , 2005, Current opinion in drug discovery & development.

[34]  F. Collins,et al.  The Human Genome Project: Lessons from Large-Scale Biology , 2003, Science.

[35]  M. Vieth,et al.  Kinomics: characterizing the therapeutically validated kinase space. , 2005, Drug discovery today.

[36]  B. Stockwell,et al.  Biological mechanism profiling using an annotated compound library. , 2003, Chemistry & biology.

[37]  A. Fliri,et al.  Biological spectra analysis: Linking biological activity profiles to molecular structure. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Pierre Acklin,et al.  Similarity Metrics for Ligands Reflecting the Similarity of the Target Proteins , 2003, J. Chem. Inf. Comput. Sci..

[39]  Kaisheng Chen,et al.  In silico gene function prediction using ontology-based pattern identification , 2005, Bioinform..

[40]  Peter Willett,et al.  Comparison of Ranking Methods for Virtual Screening in Lead-Discovery Programs , 2003, J. Chem. Inf. Comput. Sci..

[41]  Joël Ménard,et al.  The 45-year story of the development of an anti-aldosterone more specific than spironolactone , 2004, Molecular and Cellular Endocrinology.

[42]  Francis S Collins,et al.  Realizing the promise of genomics in biomedical research. , 2005, JAMA.