The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching

Background The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms.Results We report improvements to the CDK v2.0 since the v1.2 release series, specifically addressing the increased functional complexity and poor performance. We first summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to significantly better performance for substructure searching, molecular fingerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism.ConclusionsThis paper highlights our continued efforts to provide a community driven, open source cheminformatics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientific computing software.Graphical abstractCDK 2.0 provides new features and improved performance

[1]  Ulf Norinder,et al.  QSAR investigation of NaV1.7 active compounds using the SVM/Signature approach and the Bioclipse Modeling platform. , 2013, Bioorganic & medicinal chemistry letters.

[2]  Thomas Steinke,et al.  Molecular simulation grid , 2011, J. Cheminformatics.

[3]  Christoph Steinbeck,et al.  Reaction Decoder Tool (RDT): extracting features from chemical reactions , 2016, Bioinform..

[4]  Roger A. Sayle,et al.  Comparing structural fingerprints using a literature-based similarity benchmark , 2016, Journal of Cheminformatics.

[5]  Andreas Zell,et al.  jCompoundMapper: An open source Java library and command-line tool for chemical fingerprints , 2011, J. Cheminformatics.

[6]  Uko Maran,et al.  QSAR DataBank repository: open and linked qualitative and quantitative structure–activity relationship models , 2015, Journal of Cheminformatics.

[7]  Ola Spjuth,et al.  WhichCyp: prediction of cytochromes P450 inhibition , 2013, Bioinform..

[8]  Tomáš Pluskal,et al.  Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. , 2012, Analytical chemistry.

[9]  Zhimin Zhang,et al.  Parallel formula generator based on branch-and-bound algorithm for elucidating high resolution mass spectra , 2016 .

[10]  Christoph Steinbeck,et al.  The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013 , 2012, Nucleic Acids Res..

[11]  Christoph Steinbeck,et al.  JChemPaint - Using the collaborative forces of the Internet to develop a free editor for 2D chemical structures , 2000 .

[12]  Jean-Loup Faulon,et al.  The Signature Molecular Descriptor. 1. Using Extended Valence Sequences in QSAR and QSPR Studies , 2003, J. Chem. Inf. Comput. Sci..

[13]  Nils M. Kriege,et al.  Visual Analysis of Biological Activity Data with Scaffold Hunter , 2013, Molecular informatics.

[14]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[15]  S. Böcker,et al.  Searching molecular structure databases with tandem mass spectra using CSI:FingerID , 2015, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Oliver Fiehn,et al.  Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry , 2007, BMC Bioinformatics.

[17]  Kristian Fog Nielsen,et al.  Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking , 2016, Nature Biotechnology.

[18]  Antje Chang,et al.  BRENDA in 2017: new perspectives and new tools in BRENDA , 2016, Nucleic Acids Res..

[19]  Matej Oresic,et al.  MZmine 2: Modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data , 2010, BMC Bioinformatics.

[20]  Matthias Müller-Hannemann,et al.  In silico fragmentation for computer assisted identification of metabolite mass spectra , 2010, BMC Bioinformatics.

[21]  Gemma L. Holliday,et al.  EC-BLAST: A Tool to Automatically Search and Compare Enzyme Reactions , 2014, Nature Methods.

[22]  John W. May,et al.  Cheminformatics for genome-scale metabolic reconstructions , 2015 .

[23]  Henry S. Rzepa,et al.  CML: Evolution and design , 2011, J. Cheminformatics.

[24]  Robert M. Hanson,et al.  Jmol – a paradigm shift in crystallographic visualization , 2010 .

[25]  Alex M. Clark,et al.  New target prediction and visualization tools incorporating open source molecular fingerprints for TB Mobile 2.0 , 2014, Journal of Cheminformatics.

[26]  Rajarshi Guha,et al.  Chemical Informatics Functionality in R , 2007 .

[27]  Egon L. Willighagen,et al.  The Blue Obelisk—Interoperability in Chemical Informatics , 2006, J. Chem. Inf. Model..

[28]  Lindsey Negrete All the small things. , 2015, Academic medicine : journal of the Association of American Medical Colleges.

[29]  Joanna L. Sharman,et al.  The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: towards curated quantitative interactions between 1300 protein targets and 6000 ligands , 2015, Nucleic Acids Res..

[30]  Paolo Toth,et al.  Knapsack Problems: Algorithms and Computer Implementations , 1990 .

[31]  Zsuzsanna Lipták,et al.  SIRIUS: decomposing isotope patterns for metabolite identification , 2008, Bioinform..

[32]  Sebastian Böcker,et al.  Faster Mass Decomposition , 2013, WABI.

[33]  George Papadatos,et al.  The ChEMBL bioactivity database: an update , 2013, Nucleic Acids Res..

[34]  Egon Willighagen,et al.  Fast and Scriptable Molecular Graphics in Web Browsers without Java3D , 2007 .

[35]  Egon Willighagen,et al.  Groovy Cheminformatics with the Chemistry Development Kit , 2011 .

[36]  David Vidal,et al.  LINGO, an Efficient Holographic Text Based Method To Calculate Biophysical Properties and Intermolecular Similarities , 2005, J. Chem. Inf. Model..

[37]  Kevin Lawson,et al.  LICSS - a chemical spreadsheet in microsoft excel , 2012, Journal of Cheminformatics.

[38]  Marwin H. S. Segler,et al.  Modelling Chemical Reasoning to Predict Reactions , 2016, Chemistry.

[39]  John B. O. Mitchell,et al.  Classifying the World Anti-Doping Agency's 2005 Prohibited List Using the Chemistry Development Kit Fingerprint , 2006, CompLife.

[40]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo-and Bioinformatics , 2003, J. Chem. Inf. Comput. Sci..

[41]  Ola Spjuth,et al.  Use of historic metabolic biotransformation data as a means of anticipating metabolic sites using MetaPrint2D and Bioclipse , 2010, BMC Bioinformatics.

[42]  Arthur Dalby,et al.  Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited , 1992, J. Chem. Inf. Comput. Sci..

[43]  Thorsten Meinl,et al.  KNIME-CDK: Workflow-driven cheminformatics , 2013, BMC Bioinformatics.

[44]  Erich Kleinpeter,et al.  JChemPaint - Using the Collaborative Forces of the Internet to Develop a Free Editor for 2D Chemical Structures. , 1999 .

[45]  C. Steinbeck,et al.  Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. , 2006, Current pharmaceutical design.

[46]  Zsuzsanna Lipták,et al.  Efficient mass decomposition , 2005, SAC '05.

[47]  Egon L. Willighagen,et al.  Bioclipse: an open source workbench for chemo- and bioinformatics , 2007, BMC Bioinformatics.

[48]  Ola Spjuth,et al.  Benchmarking Study of Parameter Variation When Using Signature Fingerprints Together with Support Vector Machines , 2014, J. Chem. Inf. Model..

[49]  Antony J. Williams,et al.  The Chemical Validation and Standardization Platform (CVSP): large-scale automated validation of chemical structure datasets , 2015, Journal of Cheminformatics.

[50]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[51]  Jean-Loup Faulon,et al.  OMG: Open Molecule Generator , 2012, Journal of Cheminformatics.

[52]  Igor V. Filippov,et al.  Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on , 2011, J. Cheminformatics.

[53]  Lemont B. Kier,et al.  Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information , 1995, J. Chem. Inf. Comput. Sci..

[54]  James G. Nourse,et al.  The substance module: the representation, storage, and searching of complex structures , 1991, J. Chem. Inf. Comput. Sci..

[55]  Alex M. Clark,et al.  Basic primitives for molecular diagram sketching , 2010, J. Cheminformatics.

[56]  Harold E. Helson Structure Diagram Generation , 2010 .

[57]  John Figueras,et al.  Ring Perception Using Breadth-First Search , 1996, J. Chem. Inf. Comput. Sci..

[58]  Ola Spjuth,et al.  Large-scale ligand-based predictive modelling using support vector machines , 2016, Journal of Cheminformatics.

[59]  Trevor I. Dix,et al.  Comparative analysis of long DNA sequences by per element information content using different contexts , 2007, BMC Bioinformatics.

[60]  Stefan Wetzel,et al.  Interactive exploration of chemical space with Scaffold Hunter. , 2009, Nature chemical biology.

[61]  Egon L. Willighagen,et al.  Elemental composition determination based on MSn , 2011, Bioinform..

[62]  Melanie C. Burger,et al.  ChemDoodle Web Components: HTML5 toolkit for chemical graphics, interfaces, and informatics , 2015, Journal of Cheminformatics.

[63]  Zsuzsanna Lipták,et al.  DECOMP - from interpreting Mass Spectrometry peaks to solving the Money Changing Problem , 2008, Bioinform..

[64]  Nina Jeliazkova,et al.  AMBIT RESTful web services: an implementation of the OpenTox application programming interface , 2011, J. Cheminformatics.

[65]  Christoph Steinbeck,et al.  Efficient ring perception for the Chemistry Development Kit , 2014, Journal of Cheminformatics.

[66]  Nina Jeliazkova,et al.  Ambit‐Tautomer: An Open Source Tool for Tautomer Generation , 2013, Molecular informatics.

[67]  Robert D. Carr,et al.  The Signature Molecular Descriptor. 4. Canonizing Molecules Using Extended Valence Sequences , 2004, J. Chem. Inf. Model..

[68]  Franziska Berger,et al.  Counterexamples in Chemical Ring Perception , 2004, J. Chem. Inf. Model..

[69]  Nina Jeliazkova,et al.  AMBIT‐SMARTS: Efficient Searching of Chemical Structures and Fragments , 2011, Molecular informatics.

[70]  Noel M. O'Boyle,et al.  Cinfony – combining Open Source cheminformatics toolkits behind a common interface , 2008, Chemistry Central journal.

[71]  Ola Spjuth,et al.  Ligand-Based Target Prediction with Signature Fingerprints , 2014, J. Chem. Inf. Model..

[72]  Dong-Sheng Cao,et al.  ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation , 2015, Journal of Cheminformatics.

[73]  Egon L. Willighagen,et al.  Bioclipse 2: A scriptable integration platform for the life sciences , 2009, BMC Bioinformatics.

[74]  Rainer Schrader,et al.  Small Molecule Subgraph Detector (SMSD) toolkit , 2009, J. Cheminformatics.

[75]  Andrew Dalke,et al.  The FPS fingerprint format and chemfp toolkit , 2013, Journal of Cheminformatics.

[76]  César de Pablo-Sánchez,et al.  Resolving anaphoras for the extraction of drug-drug interactions in pharmacological documents , 2010, BMC Bioinformatics.

[77]  J. Qin,et al.  Network Analysis Guided Synthesis of Weisaconitine D and Liljestrandinine , 2015, Nature.

[78]  Egon L. Willighagen,et al.  Applications of the InChI in cheminformatics with the CDK and Bioclipse , 2013, Journal of Cheminformatics.

[79]  Ola Spjuth,et al.  Open source drug discovery with bioclipse. , 2012, Current topics in medicinal chemistry.

[80]  Ola Spjuth,et al.  Scaling Predictive Modeling in Drug Development with Cloud Computing , 2015, J. Chem. Inf. Model..

[81]  Johann Gasteiger,et al.  Hash codes for the identification and classification of molecular structure elements , 1994, J. Comput. Chem..

[82]  Tae Yong Kim,et al.  ReactPRED: a tool to predict and analyze biochemical reactions , 2016, Bioinform..

[83]  Alex M Clark Rendering Molecular Sketches for Publication Quality Output , 2013, Molecular informatics.

[84]  Thorsten Meinl What's new in KNIME? , 2012, Journal of Cheminformatics.

[85]  Egon L. Willighagen,et al.  New developments on the cheminformatics open workflow environment CDK-Taverna , 2011, J. Cheminformatics.

[86]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[87]  Frederick P. Roth,et al.  Chemical substructures that enrich for biological activity , 2008, Bioinform..

[88]  Ola Spjuth,et al.  Integrated Decision Support for Assessing Chemical Liabilities , 2011, J. Chem. Inf. Model..