Library Enhancement through the Wisdom of Crowds

We present a novel approach for enhancing the diversity of a chemical library rooted on the theory of the wisdom of crowds. Our approach was motivated by a desire to tap into the collective experience of our global medicinal chemistry community and involved four basic steps: (1) Candidate compounds for acquisition were screened using various structural and property filters in order to eliminate clearly nondrug-like matter. (2) The remaining compounds were clustered together with our in-house collection using a novel fingerprint-based clustering algorithm that emphasizes common substructures and works with millions of molecules. (3) Clusters populated exclusively by external compounds were identified as "diversity holes," and representative members of these clusters were presented to our global medicinal chemistry community, who were asked to specify which ones they liked, disliked, or were indifferent to using a simple point-and-click interface. (4) The resulting votes were used to rank the clusters from most to least desirable, and to prioritize which ones should be targeted for acquisition. Analysis of the voting results reveals interesting voter behaviors and distinct preferences for certain molecular property ranges that are fully consistent with lead-like profiles established through systematic analysis of large historical databases.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  James Surowiecki The wisdom of crowds: Why the many are smarter than the few and how collective wisdom shapes business, economies, societies, and nations Doubleday Books. , 2004 .

[3]  Michael Farnum,et al.  Broadening access to electronic healthcare databases , 2010, Nature Reviews Drug Discovery.

[4]  Denis M. Bayada,et al.  Molecular Diversity and Representativity in Chemical Databases , 1999, J. Chem. Inf. Comput. Sci..

[5]  P. Willett,et al.  Combination of molecular similarity measures using data fusion , 2000 .

[6]  William L. Jorgensen,et al.  Journal of Chemical Information and Modeling , 2005, J. Chem. Inf. Model..

[7]  Naomie Salim,et al.  Combination of Fingerprint-Based Similarity Coefficients Using Data Fusion , 2003, J. Chem. Inf. Comput. Sci..

[8]  Dimitris K Agrafiotis,et al.  A QSAR Model of hERG Binding Using a Large, Diverse, and Internally Consistent Training Set , 2006, Chemical biology & drug design.

[9]  J B Dunbar Compound acquisition strategies. , 2000, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[10]  Tudor I. Oprea,et al.  A crowdsourcing evaluation of the NIH chemical probes. , 2009, Nature chemical biology.

[11]  Michael S Lajiness,et al.  Assessment of the consistency of medicinal chemists in reviewing sets of compounds. , 2004, Journal of medicinal chemistry.

[12]  Dimitris K. Agrafiotis,et al.  Efficient Substructure Searching of Large Chemical Libraries: The ABCD Chemical Cartridge , 2011, J. Chem. Inf. Model..

[13]  A. Ghose,et al.  Atomic Physicochemical Parameters for Three‐Dimensional Structure‐Directed Quantitative Structure‐Activity Relationships I. Partition Coefficients as a Measure of Hydrophobicity , 1986 .

[14]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[15]  Dimitris K Agrafiotis,et al.  SAR maps: a new SAR visualization technique for medicinal chemists. , 2007, Journal of medicinal chemistry.

[16]  Xiaoyang Xia,et al.  Classification of kinase inhibitors using a Bayesian model. , 2004, Journal of medicinal chemistry.

[17]  Dimitris K. Agrafiotis,et al.  A Cluster-Based Strategy for Assessing the Overlap between Large Chemical Libraries and Its Application to a Recent Acquisition , 2006, J. Chem. Inf. Model..

[18]  Ian A. Watson,et al.  Experimental Designs for Selecting Molecules from Large Chemical Databases , 1997, J. Chem. Inf. Comput. Sci..

[19]  I. Muegge Selection criteria for drug‐like compounds , 2003, Medicinal research reviews.

[20]  John J. M. Wiener,et al.  Scaffold explorer: an interactive tool for organizing and mining structure-activity data spanning multiple chemotypes. , 2010, Journal of medicinal chemistry.

[21]  Didier Rognan,et al.  ConsDock: A new program for the consensus analysis of protein–ligand interactions , 2002, Proteins.

[22]  Johannes H. Voigt,et al.  Comparison of the NCI Open Database with Seven Large Chemical Structural Databases , 2001, J. Chem. Inf. Comput. Sci..

[23]  Nicolas Foloppe,et al.  Drug-like Annotation and Duplicate Analysis of a 23-Supplier Chemical Database Totalling 2.7 Million Compounds , 2004, J. Chem. Inf. Model..

[24]  Walter Cedeño,et al.  On the Use of Neural Network Ensembles in QSAR and QSPR , 2002, J. Chem. Inf. Comput. Sci..

[25]  M. Murcko,et al.  Consensus scoring: A method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. , 1999, Journal of medicinal chemistry.

[26]  Stephen K. Durham,et al.  Predicting the Genotoxicity of Secondary and Aromatic Amines Using Data Subsetting To Generate a Model Ensemble , 2003, J. Chem. Inf. Comput. Sci..

[27]  Peter Willett,et al.  Rapid Quantification of Molecular Diversity for Selective Database Acquisition , 1997, J. Chem. Inf. Comput. Sci..

[28]  Peter Willett,et al.  Bit-String Methods for Selective Compound Acquisition , 2000, J. Chem. Inf. Comput. Sci..

[29]  Meir Glick,et al.  Prediction of Biological Targets for Compounds Using Multiple-Category Bayesian Models Trained on Chemogenomics Databases , 2006, J. Chem. Inf. Model..

[30]  Christophe Cleva,et al.  Chemical substructures in drug discovery. , 2003, Drug discovery today.

[31]  Luc Morin-Allory,et al.  2D QSAR Consensus Prediction for High-Throughput Virtual Screening. An Application to COX-2 Inhibition Modeling and Screening of the NCI Database , 2004, J. Chem. Inf. Model..

[32]  José L. Medina-Franco,et al.  Characterization of Activity Landscapes Using 2D and 3D Similarity Methods: Consensus Activity Cliffs , 2009, J. Chem. Inf. Model..

[33]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[34]  Dimitris K. Agrafiotis,et al.  Multiobjective optimization of combinatorial libraries , 2002, J. Comput. Aided Mol. Des..

[35]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[36]  Edmund A. Mennis The Wisdom of Crowds: Why the Many Are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations , 2006 .

[37]  Dimitris K. Agrafiotis,et al.  Single R-Group Polymorphisms (SRPs) and R-Cliffs: An Intuitive Framework for Analyzing and Visualizing Activity Cliffs in a Single Analog Series , 2011, J. Chem. Inf. Model..

[38]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[39]  James B. Dunbar,et al.  Enhancing the diversity of a corporate database using chemical database clustering and analysis , 1995, J. Comput. Aided Mol. Des..

[40]  Igor V. Tetko,et al.  Estimation of Aqueous Solubility of Chemical Compounds Using E-State Indices , 2001, J. Chem. Inf. Comput. Sci..

[41]  Xiang Yao,et al.  Advanced Biological and Chemical Discovery (ABCD): Centralizing Discovery Knowledge in an Inherently Decentralized World , 2007, J. Chem. Inf. Model..

[42]  Luc Morin-Allory,et al.  2D QSAR Consensus Prediction for High‐Throughput Virtual Screening. An Application to COX‐2 Inhibition Modeling and Screening of the NCI Database. , 2004 .

[43]  Miklos Feher,et al.  The Use of Consensus Scoring in Ligand-Based Virtual Screening , 2006, J. Chem. Inf. Model..

[44]  Dimitris K. Agrafiotis,et al.  Stochastic Algorithms for Maximizing Molecular Diversity , 1997, J. Chem. Inf. Comput. Sci..

[45]  Chenzhong Cao,et al.  Correlation between the Glass Transition Temperatures and Repeating Unit Structure for High Molecular Weight Polymers , 2003, J. Chem. Inf. Comput. Sci..

[46]  Tudor I. Oprea,et al.  Is There a Difference between Leads and Drugs? A Historical Perspective , 2001, J. Chem. Inf. Comput. Sci..

[47]  Johann Gasteiger,et al.  Structure and reaction based evaluation of synthetic accessibility , 2007, J. Comput. Aided Mol. Des..

[48]  D K Agrafiotis,et al.  Kolmogorov-Smirnov statistic and its application in library design. , 2000, Journal of molecular graphics & modelling.

[49]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[50]  Dimitris K. Agrafiotis,et al.  Enhanced SAR Maps: Expanding the Data Rendering Capabilities of a Popular Medicinal Chemistry Tool , 2009, J. Chem. Inf. Model..

[51]  David J. Cummins,et al.  Molecular Diversity in Chemical Databases: Comparison of Medicinal Chemistry Knowledge Bases and Databases of Commercially Available Compounds , 1996, J. Chem. Inf. Comput. Sci..

[52]  Tudor I. Oprea,et al.  Is There a Difference Between Leads and Drugs? A Historical Perspective. , 2001 .

[53]  Ramaswamy Nilakantan,et al.  Database diversity assessment: New ideas, concepts, and tools , 1997, J. Comput. Aided Mol. Des..