Automated Synthetic Feasibility Assessment: A Data-driven Derivation of Computational tools for Medicinal Chemistry

Automated Synthetic Feasibility Assessment: A Data-driven Derivation of Computational Tools for Medicinal Chemistry Abraham Heifets Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2014 The planning of organic syntheses, a critical problem in chemistry, can be directly modeled as resourceconstrained branching plans in a discrete, fully-observable state space. Despite this clear relationship, the full artillery of artificial intelligence has not been brought to bear on this problem due to its inherent complexity and multidisciplinary challenges. In this thesis, I describe a mapping between organic synthesis and heuristic search and build a planner that can solve such problems automatically at the undergraduate level. Along the way, I show the need for powerful heuristic search algorithms and build large databases of synthetic information, which I use to derive a qualitatively new kind of heuristic guidance.

[1]  Igor V. Filippov,et al.  Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution , 2009, J. Chem. Inf. Model..

[2]  D. Bryce,et al.  International Planning Competition Uncertainty Part: Benchmarks and Results , 2008 .

[3]  Stanley Y. W. Su,et al.  AND/OR Graph and Search Algorithm for Discovering Composite Web Services , 2005, Int. J. Web Serv. Res..

[4]  Jean-Christophe Nebel,et al.  Automatic generation of 3D motifs for classification of protein binding sites , 2007, BMC Bioinformatics.

[5]  Bernhard Nebel,et al.  The FF Planning System: Fast Plan Generation Through Heuristic Search , 2011, J. Artif. Intell. Res..

[6]  Tao Jiang,et al.  A maximum common substructure-based algorithm for searching and predicting drug-like compounds , 2008, ISMB.

[7]  Janet M. Thornton,et al.  Detection of 3D atomic similarities and their use in the discrimination of small molecule protein-binding sites , 2008, ECCB.

[8]  Igor Jurisica,et al.  Construction of New Medicines via Game Proof Search , 2012, AAAI.

[9]  E. Corey,et al.  The Logic of Chemical Synthesis: Multistep Synthesis of Complex Carbogenic Molecules (Nobel Lecture)† , 1991 .

[10]  H O Villar,et al.  Ligand‐based protein alignment and isozyme specificity of glutathione S‐transferase inhibitors , 1997, Proteins.

[11]  Allan M Jordan,et al.  The medicinal chemist's toolbox: an analysis of reactions used in the pursuit of drug candidates. , 2011, Journal of medicinal chemistry.

[12]  Jukka V. Lehtonen,et al.  Enzyme‐mononucleotide interactions: Three different folds share common structural elements for atp recognition , 1998, Protein science : a publication of the Protein Society.

[13]  John M. Barnard,et al.  Towards in-house searching of Markush structures from patents☆ , 2009 .

[14]  Tomas Hudlicky,et al.  The Way of Synthesis: Evolution of Design and Methods for Natural Products , 2007 .

[15]  Steven H. Bertz,et al.  The first general index of molecular complexity , 1981 .

[16]  Hiroyuki Iida,et al.  The PN*-search algorithm: Application to tsume-shogi , 2001, Artif. Intell..

[17]  Yosef Y. Kuttner,et al.  A consensus‐binding structure for adenine at the atomic level permits searching for the ligand site in a wide spectrum of adenine‐containing complexes , 2003, Proteins.

[18]  Alberto Martelli,et al.  Additive AND/OR Graphs , 1973, IJCAI.

[19]  Arthur C. Sanderson,et al.  AND/OR graph representation of assembly plans , 1986, IEEE Trans. Robotics Autom..

[20]  Harry Thangaraj,et al.  Information from patent office could aid replication. , 2007, Nature.

[21]  Ryan H. Lilien,et al.  Applying Medicinal Chemistry Transformations and Multiparameter Optimization to Guide the Search for High-Quality Leads and Candidates , 2011, J. Chem. Inf. Model..

[22]  Dexter Kozen,et al.  Automata and Computability , 1997, Undergraduate Texts in Computer Science.

[23]  Yehuda Koren,et al.  Collaborative filtering with temporal dynamics , 2009, KDD.

[24]  K. Bretonnel Cohen,et al.  The textual characteristics of traditional and Open Access scientific journals are similar , 2008, BMC Bioinformatics.

[25]  Tudor I. Oprea,et al.  Rapid Evaluation of Synthetic and Molecular Complexity for in Silico Chemistry , 2005, J. Chem. Inf. Model..

[26]  A. Peter Johnson,et al.  CLiDE Pro: The Latest Generation of CLiDE, a Tool for Optical Chemical Structure Recognition , 2009, J. Chem. Inf. Model..

[27]  René Barone,et al.  A New and Simple Approach to Chemical Complexity. Application to the Synthesis of Natural Products , 2001, J. Chem. Inf. Comput. Sci..

[28]  E. Corey,et al.  The Logic of Chemical Synthesis , 1989 .

[29]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[30]  Blai Bonet,et al.  Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming , 2003, ICAPS.

[31]  M. Takahashi,et al.  The performance of a noninteractive synthesis program , 1990, J. Chem. Inf. Comput. Sci..

[32]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[33]  Richard E. Korf,et al.  Iterative-Deepening-A*: An Optimal Admissible Tree Search , 1985, IJCAI.

[34]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Christoph Steinbeck,et al.  Chemical Entities of Biological Interest: an update , 2009, Nucleic Acids Res..

[37]  T. Hudlický,et al.  On the Practical Limits ofDetermining Isolated Product Yields and Ratios of Stereoisomers:Reflections, Analysis, and Redemption , 2010 .

[38]  Richard Fikes,et al.  STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving , 1971, IJCAI.

[39]  G J Kleywegt,et al.  Recognition of spatial motifs in protein structures. , 1999, Journal of molecular biology.

[40]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[41]  Kazunari Hattori,et al.  Predicting Key Example Compounds in Competitors′ Patent Applications Using Structural Information Alone. , 2008 .

[42]  Peter Murray-Rust,et al.  Chemical Name to Structure: OPSIN, an Open Source Solution , 2011, J. Chem. Inf. Model..

[43]  J. Thornton,et al.  Conformational diversity of ligands bound to proteins. , 2006, Journal of molecular biology.

[44]  E. Corey,et al.  Computer-assisted synthetic analysis. Selection of protective groups for multistep organic syntheses , 1985 .

[45]  R. Levinson,et al.  A Self-Organized Knowledge Base for Recall, Design, and Discovery in Organic Chemistry , 1986 .

[46]  Steven H. Bertz,et al.  On the complexity of graphs and molecules , 1983 .

[47]  Richard E. Korf Linear-Space Best-First Search: Summary of Results , 1992, AAAI.

[48]  A F Sanders,et al.  Empirical Explorations of SYNCHEM , 1977, Science.

[49]  Akio Tanaka,et al.  Construction of Functional Group Reactivity Database under Various Reaction Conditions Automatically Extracted from Reaction Database in a Synthesis Design System , 2010, J. Chem. Inf. Model..

[50]  J. Thornton,et al.  Shape variation in protein binding pockets and their ligands. , 2007, Journal of molecular biology.

[51]  S. Krishnan,et al.  Simulation and Evaluation of Chemical Synthesis - SECS: An Application of Artificial Intelligence Techniques , 1978, Artif. Intell..

[52]  Akihiro Kishimoto,et al.  A General Solution to the Graph History Interaction Problem , 2004, AAAI.

[53]  Ellen M. Voorhees,et al.  TREC genomics special issue overview , 2009, Information Retrieval.

[54]  Martin A. Ott,et al.  Long-Range Strategies in the LHASA Program: The Quinone Diels-Alder Transform , 1997, J. Chem. Inf. Comput. Sci..

[55]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[56]  P Argos,et al.  NADP‐Dependent enzymes. I: Conserved stereochemistry of cofactor binding , 1997, Proteins.

[57]  K. Stewart,et al.  Drug Guru: a computer software program for drug design using medicinal chemistry rules. , 2006, Bioorganic & medicinal chemistry.

[58]  Richard Waldinger,et al.  Achieving several goals simultaneously , 1977 .

[59]  Jonathan Schaeffer,et al.  Checkers Is Solved , 2007, Science.

[60]  Jonathan Schaeffer,et al.  The History Heuristic and Alpha-Beta Search Enhancements in Practice , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[61]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[62]  Yang Liu,et al.  Route Designer: A Retrosynthetic Analysis Tool Utilizing Automated Retrosynthetic Rule Generation , 2009, J. Chem. Inf. Model..

[63]  Blai Bonet,et al.  An Algorithm Better than AO*? , 2005, AAAI.

[64]  Martin Müller Proof-Set Search , 2002, Computers and Games.

[65]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[66]  N. Go,et al.  ATP binding proteins with different folds share a common ATP-binding structural motif , 1997, Nature Structural Biology.

[67]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[68]  Michael Darsow,et al.  ChEBI: a database and ontology for chemical entities of biological interest , 2007, Nucleic Acids Res..

[69]  E. Corey General methods for the construction of complex molecules , 1967 .

[70]  Jussi Rintanen,et al.  Complexity of Planning with Partial Observability , 2004, ICAPS.

[71]  Matthew H. Todd Computer-Aided Organic Synthesis , 2005 .

[72]  Akihiro Kishimoto,et al.  DF-PN in Go: An Application to the One-Eye Problem , 2003, ACG.

[73]  B. Roth,et al.  The discovery and development of atorvastatin, a potent novel hypolipidemic agent. , 2002, Progress in medicinal chemistry.

[74]  Thomas D. Griffin,et al.  Annotating patents with Medline MeSH codes via citation mapping. , 2010, Advances in experimental medicine and biology.

[75]  D. West Introduction to Graph Theory , 1995 .

[76]  Chris Morley,et al.  Pybel: a Python wrapper for the OpenBabel cheminformatics toolkit , 2008, Chemistry Central journal.

[77]  J. S. Carey,et al.  Analysis of the reactions used for the preparation of drug candidate molecules. , 2006, Organic & biomolecular chemistry.

[78]  Herbert L. Gelernter,et al.  Distributed Heuristic Synthesis Search , 1998, J. Chem. Inf. Comput. Sci..

[79]  Alan Millar,et al.  The synthesis of (4R-cis)-1,1-dimethylethyl 6-cyanomethyl-2,2-dimethyl-1,3-dioxane-4-acetate, a key intermediate for the preparation of CI-981, a highly potent, tissue selective inhibitor of HMG-CoA reductase , 1992 .

[80]  K. Denessiouk,et al.  Adenine recognition: A motif present in ATP‐, CoA‐, NAD‐, NADP‐, and FAD‐dependent proteins , 2001, Proteins.

[81]  Andrew G. Leach,et al.  Matched molecular pairs as a guide in the optimization of pharmaceutical properties; a study of aqueous solubility, plasma protein binding and oral exposure. , 2006, Journal of medicinal chemistry.

[82]  R. Lilien,et al.  LigAlign: flexible ligand-based active site alignment and analysis. , 2010, Journal of molecular graphics & modelling.

[83]  K Wang,et al.  Construction of a generic reaction knowledge base by reaction data mining. , 2001, Journal of molecular graphics & modelling.

[84]  Clara D. Christ,et al.  Mining Electronic Laboratory Notebooks: Analysis, Retrosynthesis, and Reaction Based Enumeration , 2012, J. Chem. Inf. Model..

[85]  Murray Campbell,et al.  Knowledge discovery in deep blue , 1999, Commun. ACM.

[86]  John J Tanner,et al.  A structurally conserved water molecule in Rossmann dinucleotide‐binding domains , 2002, Protein science : a publication of the Protein Society.

[87]  Wendy A. Warr,et al.  ChEMBL. An interview with John Overington, team leader, chemogenomics at the European Bioinformatics Institute Outstation of the European Molecular Biology Laboratory (EMBL-EBI) , 2009, J. Comput. Aided Mol. Des..

[88]  Blai Bonet,et al.  Learning Depth-First Search: A Unified Approach to Heuristic Search in Deterministic and Non-Deterministic Settings, and Its Application to MDPs , 2006, ICAPS.

[89]  Michael R. Genesereth,et al.  General Game Playing: Overview of the AAAI Competition , 2005, AI Mag..

[90]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[91]  Peter Ertl,et al.  Bioisosteric Replacement and Scaffold Hopping in Lead Generation and Optimization , 2010, Molecular informatics.

[92]  Rainer Schrader,et al.  Small Molecule Subgraph Detector (SMSD) toolkit , 2009, J. Cheminformatics.

[93]  Akihiro Kishimoto Dealing with Infinite Loops, Underestimation, and Overestimation of Depth-First Proof-Number Search , 2010, AAAI.

[94]  Wesley J. Chun,et al.  Python Web Development with Django , 2008 .

[95]  E. Carreira Classics in total synthesis: Targets, strategies, methods , 1996 .

[96]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[97]  Robert P. Sheridan,et al.  The Most Common Chemical Replacements in Drug-Like Compounds , 2002, J. Chem. Inf. Comput. Sci..

[98]  長井 歩,et al.  Df-pn algorithm for searching AND/OR trees and its applications , 2002 .

[99]  Johann Gasteiger,et al.  Structure and reaction based evaluation of synthetic accessibility , 2007, J. Comput. Aided Mol. Des..

[100]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[101]  Richard A. Lewis,et al.  Lessons in molecular recognition: the effects of ligand and protein flexibility on molecular docking accuracy. , 2004, Journal of medicinal chemistry.

[102]  Daniel S. Weld An Introduction to Least Commitment Planning , 1994, AI Mag..

[103]  Yoshito Kishi,et al.  Synthesis of Palytoxin from Palytoxin Carboxylic Acid , 1994 .

[104]  Philip N. Judson,et al.  Starting material oriented retrosynthetic analysis in the LHASA program. 1. General description , 1992, J. Chem. Inf. Comput. Sci..

[105]  Egon L. Willighagen,et al.  The Blue Obelisk—Interoperability in Chemical Informatics , 2006, J. Chem. Inf. Model..

[106]  Kimito Funatsu,et al.  A Novel Approach to Retrosynthetic Analysis Using Knowledge Bases Derived from Reaction Databases , 1999, J. Chem. Inf. Comput. Sci..

[107]  Anthony P. F. Cook,et al.  Computer‐aided synthesis design: 40 years on , 2012 .

[108]  Dong Xu,et al.  Wanted: unique names for unique atom positions. PDB-wide analysis of diastereotopic atom names of small molecules containing diphosphate , 2008, BMC Bioinformatics.

[109]  Jos W. H. M. Uiterwijk,et al.  Proof-Number Search and Transpositions , 1994, J. Int. Comput. Games Assoc..

[110]  Susumu Yamanobe,et al.  Development of a Method for Evaluating Drug‐Likeness and Ease of Synthesis Using a Data Set in which Compounds Are Assigned Scores Based on Chemists′ Intuition. , 2003 .

[111]  A. Roche,et al.  Organic Chemistry: , 1982, Nature.

[113]  Ping Huang,et al.  Molecular complexity: a simplified formula adapted to individual atoms , 1987, J. Chem. Inf. Comput. Sci..

[114]  Y. Kawano Using Similar Positions to Search Game Trees , 1996 .

[115]  Krishna K. Agarwal,et al.  Application of chemical transforms in EYNCHEMZ: a computer program for organic synthesis route discovery , 1978, Comput. Chem..

[116]  Christian J. A. Sigrist,et al.  Nucleic Acids Research Advance Access published November 14, 2007 The 20 years of PROSITE , 2007 .

[117]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[118]  Jonathan Schaeffer,et al.  One Jump Ahead: Computer Perfection at Checkers , 2008 .

[119]  Jean-Christophe Nebel,et al.  Generation of 3D templates of active sites of proteins with rigid prosthetic groups , 2006, German Conference on Bioinformatics.

[120]  Akihiro Kishimoto Correct and efficient search algorithms in the presence of repetitions , 2005 .

[121]  N. Meanwell Synopsis of some recent tactical application of bioisosteres in drug design. , 2011, Journal of medicinal chemistry.

[122]  E J Corey,et al.  Computer-assisted design of complex organic syntheses. , 1969, Science.

[123]  W. Delano The PyMOL Molecular Graphics System , 2002 .

[124]  Peter Murray-Rust,et al.  Mining chemical information from open patents , 2011, J. Cheminformatics.

[125]  Ruth Nussinov,et al.  The Multiple Common Point Set Problem and Its Application to Molecule Binding Pattern Detection , 2006, J. Comput. Biol..

[126]  A. Johnson,et al.  Molecular complexity analysis of de novo designed ligands. , 2006, Journal of medicinal chemistry.

[127]  Takashi Nakayama Computer-assisted knowledge acquisition system for synthesis planning , 1991, J. Chem. Inf. Comput. Sci..

[128]  Markus Wagener,et al.  The Quest for Bioisosteric Replacements , 2006, J. Chem. Inf. Model..

[129]  Akihiro Kishimoto,et al.  About the Completeness of Depth-First Proof-Number Search , 2008, Computers and Games.

[130]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[131]  Jean-Christophe Nebel,et al.  Modelling of P450 active site based on Consensus 3D structures , 2005 .

[132]  Vladimir Prelog,et al.  Specification of Molecular Chirality , 1966 .

[133]  George Karypis,et al.  Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods , 2010, J. Chem. Inf. Model..

[134]  E. LaVoie,et al.  Bioisosterism: A Rational Approach in Drug Design. , 1996, Chemical reviews.

[135]  Ajay,et al.  Kinase patent space visualization using chemical replacements. , 2006, Journal of medicinal chemistry.

[136]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[137]  Shlomo Zilberstein,et al.  LAO*: A heuristic search algorithm that finds solutions with loops , 2001, Artif. Intell..

[138]  J C Baber,et al.  Predicting synthetic accessibility: application in drug discovery and development. , 2004, Mini reviews in medicinal chemistry.

[139]  Lin-Li Li,et al.  RASA: A Rapid Retrosynthesis-Based Scoring Method for the Assessment of Synthetic Accessibility of Drug-like Molecules , 2011, J. Chem. Inf. Model..

[140]  Emil L. Post Recursive Unsolvability of a problem of Thue , 1947, Journal of Symbolic Logic.

[141]  Meir Glick,et al.  Inside the Mind of a Medicinal Chemist: The Role of Human Bias in Compound Prioritization during Drug Discovery , 2012, PloS one.

[142]  W. Kabsch A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[143]  Shin-Shyong Tseng Computer-Assisted Reaction Searching Directed toward the Synthesis of Target Molecules , 1997, J. Chem. Inf. Comput. Sci..

[144]  Hector Muñoz-Avila,et al.  SHOP: Simple Hierarchical Ordered Planner , 1999, IJCAI.

[145]  Igor Jurisica,et al.  SCRIPDB: a portal for easy access to syntheses, chemicals and reactions in patents , 2011, Nucleic Acids Res..

[146]  Daniel Bryce,et al.  A Tutorial on Planning Graph Based Reachability Heuristics , 2007, AI Mag..

[147]  W. Scott Spangler,et al.  SIMPLE: A Strategic Information Mining Platform for Licensing and Execution , 2009, 2009 IEEE International Conference on Data Mining Workshops.