GTM‐Based QSAR Models and Their Applicability Domains

In this paper we demonstrate that Generative Topographic Mapping (GTM), a machine learning method traditionally used for data visualisation, can be efficiently applied to QSAR modelling using probability distribution functions (PDF) computed in the latent 2‐dimensional space. Several different scenarios of the activity assessment were considered: (i) the “activity landscape” approach based on direct use of PDF, (ii) QSAR models involving GTM‐generated on descriptors derived from PDF, and, (iii) the k‐Nearest Neighbours approach in 2D latent space. Benchmarking calculations were performed on five different datasets: stability constants of metal cations Ca2+, Gd3+ and Lu3+ complexes with organic ligands in water, aqueous solubility and activity of thrombin inhibitors. It has been shown that the performance of GTM‐based regression models is similar to that obtained with some popular machine‐learning methods (random forest, k‐NN, M5P regression tree and PLS) and ISIDA fragment descriptors. By comparing GTM activity landscapes built both on predicted and experimental activities, we may visually assess the model’s performance and identify the areas in the chemical space corresponding to reliable predictions. The applicability domain used in this work is based on data likelihood. Its application has significantly improved the model performances for 4 out of 5 datasets.

[1]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[3]  Lutgarde M. C. Buydens,et al.  Self- and Super-organizing Maps in R: The kohonen Package , 2007 .

[4]  Edzer J. Pebesma,et al.  Multivariable geostatistics in S: the gstat package , 2004, Comput. Geosci..

[5]  Peter C. Jurs,et al.  Prediction of Aqueous Solubility of Heteroatom-Containing Organic Compounds from Molecular Structure , 2001, J. Chem. Inf. Comput. Sci..

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  Fiorella Ruggiu,et al.  Individual Hydrogen‐Bond Strength QSPR Modelling with ISIDA Local Descriptors: a Step Towards Polyfunctional Molecules , 2014, Molecular informatics.

[8]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[9]  Jarmo Huuskonen,et al.  Estimation of Aqueous Solubility for a Diverse Set of Organic Compounds Based on Molecular Topology , 2000, J. Chem. Inf. Comput. Sci..

[10]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[11]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[12]  Héléna A. Gaspar,et al.  Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure‐Activity Modeling and Dataset Comparison , 2012, Molecular informatics.

[13]  Natalia Kireeva,et al.  Toward Navigating Chemical Space of Ionic Liquids: Prediction of Melting Points Using Generative Topographic Maps , 2012 .

[14]  Alexandre Arenas,et al.  A Fuzzy ARTMAP Based on Quantitative Structure-Property Relationships (QSPRs) for Predicting Aqueous Solubility of Organic Compounds , 2001, J. Chem. Inf. Comput. Sci..

[15]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[16]  N. Kireeva,et al.  Towards in silico identification of the human ether-a-go-go-related gene channel blockers: discriminative vs. generative classification models , 2013, SAR and QSAR in environmental research.

[17]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[18]  Alexandre Varnek,et al.  New Approach for Accurate QSPR Modeling of Metal Complexation: Application to Stability Constants of Complexes of Lanthanide Ions Ln 3+ , Ag + , Zn 2+ , Cd 2+ and Hg 2+ with Organic Ligands in Water , 2012 .

[19]  Dimitris K. Agrafiotis,et al.  Stochastic proximity embedding , 2003, J. Comput. Chem..

[20]  Horvath Dragos,et al.  Predicting the predictability: a unified approach to the applicability domain problem of QSAR models. , 2009, Journal of chemical information and modeling.

[21]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[22]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[23]  Alban Arrault,et al.  Generative Topographic Mapping-Based Classification Models and Their Applicability Domain: Application to the Biopharmaceutics Drug Disposition Classification System (BDDCS) , 2013, J. Chem. Inf. Model..