The rcdk and cluster R packages applied to drug candidate selection

The aim of this article is to show how thevpower of statistics and cheminformatics can be combined, in R, using two packages: rcdk and cluster . We describe the role of clustering methods for identifying similar structures in a group of 23 molecules according to their fingerprints. The most commonly used method is to group the molecules using a “score” obtained by measuring the average distance between them. This score reflects the similarity/non-similarity between compounds and helps us identify active or potentially toxic substances through predictive studies. Clustering is the process by which the common characteristics of a particular class of compounds are identified. For clustering applications, we are generally measure the molecular fingerprint similarity with the Tanimoto coefficient. Based on the molecular fingerprints, we calculated the molecular distances between the methotrexate molecule and the other 23 molecules in the group, and organized them into a matrix. According to the molecular distances and Ward ’s method, the molecules were grouped into 3 clusters. We can presume structural similarity between the compounds and their locations in the cluster map. Because only 5 molecules were included in the methotrexate cluster, we considered that they might have similar properties and might be further tested as potential drug candidates.

[1]  Thorsten Meinl,et al.  KNIME-CDK: Workflow-driven cheminformatics , 2013, BMC Bioinformatics.

[2]  Chris Morley,et al.  Open Babel: An open chemical toolbox , 2011, J. Cheminformatics.

[3]  S. Rees,et al.  Principles of early drug discovery , 2011, British journal of pharmacology.

[4]  B. Firdaus Begam,et al.  A Study on Cheminformatics and its Applications on Modern Drug Discovery , 2012 .

[5]  Edward W. Lowe,et al.  Computational Methods in Drug Discovery , 2014, Pharmacological Reviews.

[6]  Noel M. O'Boyle Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI , 2012, Journal of Cheminformatics.

[7]  Krzysztof Kryszczuk,et al.  Estimation of the Number of Clusters Using Multiple Clustering Validity Indices , 2010, MCS.

[8]  Fionn Murtagh,et al.  Algorithms for hierarchical clustering: an overview , 2012, WIREs Data Mining Knowl. Discov..

[9]  Jie Min,et al.  Small Molecule Accurate Recognition Technology (SMART) to Enhance Natural Products Research , 2017, Scientific Reports.

[10]  J. Bajorath,et al.  Scaffold hopping using two-dimensional fingerprints: true potential, black magic, or a hopeless endeavor? Guidelines for virtual screening. , 2010, Journal of medicinal chemistry.

[11]  Carlton A Taft,et al.  Current topics in computer-aided drug design. , 2008, Journal of pharmaceutical sciences.

[12]  M. Wagener,et al.  Potential Drugs and Nondrugs: Prediction and Identification of Important Structural Features. , 2000 .

[13]  Peter Willett,et al.  Similarity methods in chemoinformatics , 2009, Annu. Rev. Inf. Sci. Technol..

[14]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[15]  Stephani Joy Y Macalino,et al.  Role of computer-aided drug design in modern drug discovery , 2015, Archives of Pharmacal Research.

[16]  Max Kuhn,et al.  The use of the R language for medicinal chemistry applications. , 2012, Current topics in medicinal chemistry.

[17]  R. W. Hansen,et al.  The price of innovation: new estimates of drug development costs. , 2003, Journal of health economics.

[18]  Sylvain Chartier,et al.  The k-means clustering technique: General considerations and implementation in Mathematica , 2013 .

[19]  Peter Willett,et al.  Similarity searching using 2D structural fingerprints. , 2011, Methods in molecular biology.

[20]  Eric Martin,et al.  Euclidean chemical spaces from molecular fingerprints: Hamming distance and Hempel’s ravens , 2015, Journal of Computer-Aided Molecular Design.

[21]  Jürgen Bajorath,et al.  Anatomy of Fingerprint Search Calculations on Structurally Diverse Sets of Active Compounds , 2005, J. Chem. Inf. Model..

[22]  M. Scherer,et al.  Methotrexate chemotherapy reduces osteogenesis but increases adipogenic potential in the bone marrow , 2012, Journal of cellular physiology.

[23]  Guy N. Brock,et al.  clValid , an R package for cluster validation , 2008 .

[24]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[25]  Jürgen Bajorath,et al.  Design of chemical space networks using a Tanimoto similarity variant based upon maximum common substructures , 2015, Journal of Computer-Aided Molecular Design.

[26]  Inho Choi,et al.  Computer Aided Drug Design: Success and Limitations. , 2016, Current pharmaceutical design.

[27]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[28]  Jürgen Bajorath,et al.  Erratum to: Design of chemical space networks using a Tanimoto similarity variant based upon maximum common substructures , 2015, Journal of Computer-Aided Molecular Design.

[29]  Y-H Taguchi,et al.  Identification of candidate drugs using tensor-decomposition-based unsupervised feature extraction in integrated analysis of gene expression between diseases and DrugMatrix datasets , 2017, Scientific Reports.

[30]  S. Avram,et al.  Docking Study of 3-mercapto-1,2,4-triazole Derivatives as Inhibitors for VEGFR and EGFR , 2017 .

[31]  Eréndira Rendón,et al.  A comparison of internal and external cluster validation indexes , 2011 .

[32]  M. Markowicz,et al.  Adaptation of High-Throughput Screening in Drug Discovery—Toxicological Screening Tests , 2011, International journal of molecular sciences.

[33]  Julien Jacques,et al.  Functional data clustering: a survey , 2013, Advances in Data Analysis and Classification.

[34]  Barileé B. Baridam,et al.  More work on K -Means Clustering Algorithm: The Dimensionality Problem , 2012 .

[35]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[36]  Károly Héberger,et al.  Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? , 2015, Journal of Cheminformatics.

[37]  Rajarshi Guha,et al.  Chemical Informatics Functionality in R , 2007 .

[38]  Renu Vyas,et al.  Machine Learning Methods in Chemoinformatics for Drug Discovery , 2014 .

[39]  Yiqun Cao,et al.  ChemMine tools: an online service for analyzing and clustering small molecules , 2011, Nucleic Acids Res..

[40]  Fionn Murtagh,et al.  Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? , 2011, Journal of Classification.

[41]  Wendy A. Warr,et al.  Representation of chemical structures , 2011 .

[42]  Rajarshi Guha,et al.  Advances in cheminformatics methodologies and infrastructure to support the data mining of large, heterogeneous chemical datasets. , 2010, Current computer-aided drug design.

[43]  John MacCuish,et al.  Chemoinformatics applications of cluster analysis , 2014 .

[44]  Naomie Salim,et al.  Voting-based consensus clustering for combining multiple clusterings of chemical structures , 2012, Journal of Cheminformatics.

[45]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching , 2017, Journal of Cheminformatics.

[46]  P. Hajduk,et al.  Cheminformatic tools for medicinal chemists. , 2010, Journal of medicinal chemistry.

[47]  Egon L. Willighagen,et al.  Cheminformatics , 2012, CACM.