Modeling a Crowdsourced Definition of Molecular Complexity

This paper brings together the concepts of molecular complexity and crowdsourcing. An exercise was done at Merck where 386 chemists voted on the molecular complexity (on a scale of 1-5) of 2681 molecules taken from various sources: public, licensed, and in-house. The meanComplexity of a molecule is the average over all votes for that molecule. As long as enough votes are cast per molecule, we find meanComplexity is quite easy to model with QSAR methods using only a handful of physical descriptors (e.g., number of chiral centers, number of unique topological torsions, a Wiener index, etc.). The high level of self-consistency of the model (cross-validated R(2) ∼0.88) is remarkable given that our chemists do not agree with each other strongly about the complexity of any given molecule. Thus, the power of crowdsourcing is clearly demonstrated in this case. The meanComplexity appears to be correlated with at least one metric of synthetic complexity from the literature derived in a different way and is correlated with values of process mass intensity (PMI) from the literature and from in-house studies. Complexity can be used to differentiate between in-house programs and to follow a program over time.

[1]  Steven H. Bertz,et al.  Complexity of synthetic reactions. The use of complexity indices to evaluate reactions, transforms and disconnections , 2003 .

[2]  Peter Ertl,et al.  Relationships between Molecular Complexity, Biological Activity, and Structural Diversity , 2006, J. Chem. Inf. Model..

[3]  H. W. Whitlock,et al.  On the Structure of Total Synthesis of Complex Natural Products , 1998 .

[4]  A. Leach,et al.  Molecular complexity and fragment-based drug discovery: ten years on. , 2011, Current opinion in chemical biology.

[5]  Peter Ertl,et al.  Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions , 2009, J. Cheminformatics.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Steven H. Bertz,et al.  Organic Synthesis — Art or Science? , 2004 .

[8]  Robert P. Sheridan,et al.  Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction , 2013, J. Chem. Inf. Model..

[9]  Ian A. Watson,et al.  Complexity-Based Metric for Process Mass Intensity in the Pharmaceutical Industry , 2013 .

[10]  Paul Gillespie,et al.  A Crowd‐Based Process and Tool for HTS Hit Triage , 2013, Molecular informatics.

[11]  Pascal Bonnet,et al.  Is chemical synthetic accessibility computationally predictable for drug and lead-like molecules? A comparative assessment between medicinal and computational chemists. , 2012, European journal of medicinal chemistry.

[12]  Ping Huang,et al.  Molecular complexity: a simplified formula adapted to individual atoms , 1987, J. Chem. Inf. Comput. Sci..

[13]  Ramaswamy Nilakantan,et al.  Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors , 1987, J. Chem. Inf. Comput. Sci..

[14]  Alexandru T. Balaban,et al.  Chemical graphs , 1979 .

[15]  Steven H. Bertz,et al.  The first general index of molecular complexity , 1981 .

[16]  Concepción Jiménez-González,et al.  Using the Right Green Yardstick: Why Process Mass Intensity Is Used in the Pharmaceutical Industry To Drive More Sustainable Processes , 2011 .

[17]  Thomas Sander,et al.  About Complexity and Self-Similarity of Chemical Structures in Drug Discovery , 2013, CCS 2013.

[18]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[19]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[20]  Meir Glick,et al.  Inside the Mind of a Medicinal Chemist: The Role of Human Bias in Compound Prioritization during Drug Discovery , 2012, PloS one.

[21]  Bilge Baytekin,et al.  Estimating chemical reactivity and cross-influence from collective chemical knowledge , 2012 .

[22]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[23]  Tudor I. Oprea,et al.  Rapid Evaluation of Synthetic and Molecular Complexity for in Silico Chemistry , 2005, J. Chem. Inf. Model..

[24]  René Barone,et al.  A New and Simple Approach to Chemical Complexity. Application to the Synthesis of Natural Products , 2001, J. Chem. Inf. Comput. Sci..

[25]  Ting Wang,et al.  Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling , 2005, J. Chem. Inf. Model..

[26]  Dimitris K. Agrafiotis,et al.  Library Enhancement through the Wisdom of Crowds , 2011, J. Chem. Inf. Model..

[27]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[28]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[29]  J. Irwin,et al.  ZINC ? A Free Database of Commercially Available Compounds for Virtual Screening. , 2005 .