Data Visualization, Regression, Applicability Domains and Inverse Analysis Based on Generative Topographic Mapping

This paper introduces two generative topographic mapping (GTM) methods that can be used for data visualization, regression analysis, inverse analysis, and the determination of applicability domains (ADs). In GTM‐multiple linear regression (GTM‐MLR), the prior probability distribution of the descriptors or explanatory variables (X) is calculated with GTM, and the posterior probability distribution of the property/activity or objective variable (y) given X is calculated with MLR; inverse analysis is then performed using the product rule and Bayes’ theorem. In GTM‐regression (GTMR), X and y are combined and GTM is performed to obtain the joint probability distribution of X and y; this leads to the posterior probability distributions of y given X and of X given y, which are used for regression and inverse analysis, respectively. Simulations using linear and nonlinear datasets and quantitative structure‐activity relationship (QSAR) and quantitative structure‐property relationship (QSPR) datasets confirm that GTM‐MLR and GTMR enable data visualization, regression analysis, and inverse analysis considering appropriate ADs. Python and MATLAB codes for the proposed algorithms are available at https://github.com/hkaneko1985/gtm‐generativetopographicmapping.

[1]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[2]  In-Beum Lee,et al.  A novel multivariate regression approach based on kernel partial least squares with orthogonal signal correction , 2005 .

[3]  Gergana Dimitrova,et al.  A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models , 2005, J. Chem. Inf. Model..

[4]  Yoshimasa Takahashi,et al.  De Novo Design of Drug-Like Molecules by a Fragment-Based Molecular Evolutionary Approach , 2014, J. Chem. Inf. Model..

[5]  Robert C. Glen,et al.  Random Forest Models To Predict Aqueous Solubility , 2007, J. Chem. Inf. Model..

[6]  Sagarika Sahoo,et al.  A Short Review of the Generation of Molecular Descriptors and Their Applications in Quantitative Structure Property/Activity Relationships. , 2016, Current computer-aided drug design.

[7]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[8]  Zitong Li,et al.  Overview of LASSO-related penalized regression methods for quantitative trait mapping and genomic selection , 2012, Theoretical and Applied Genetics.

[9]  King-Sun Fu,et al.  Pattern Recognition and Machine Learning , 2012 .

[10]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[11]  Horvath Dragos,et al.  Predicting the predictability: a unified approach to the applicability domain problem of QSAR models. , 2009, Journal of chemical information and modeling.

[12]  Igor V. Tetko,et al.  Applicability Domains for Classification Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set , 2010, J. Chem. Inf. Model..

[13]  Héléna A. Gaspar,et al.  GTM‐Based QSAR Models and Their Applicability Domains , 2015, Molecular informatics.

[14]  Mark S. Johnson,et al.  Generating Conformer Ensembles Using a Multiobjective Genetic Algorithm , 2007, J. Chem. Inf. Model..

[15]  Hiromasa Kaneko,et al.  Fast optimization of hyperparameters for support vector regression models with highly predictive ability , 2015 .

[16]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[17]  Hiromasa Kaneko,et al.  Inverse QSPR/QSAR Analysis for Chemical Structure Generation (from y to x) , 2016, J. Chem. Inf. Model..

[18]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery. 4. Prediction of Aqueous Solubility Based on Atom Contribution Approach , 2004, J. Chem. Inf. Model..

[19]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[20]  Hiromasa Kaneko,et al.  Development of a New De Novo Design Algorithm for Exploring Chemical Space , 2014, Molecular informatics.

[21]  Hiromasa Kaneko,et al.  Applicability Domain Based on Ensemble Learning in Classification and Regression Analyses , 2014, J. Chem. Inf. Model..

[22]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[23]  Matthew Clark,et al.  Grand Canonical Monte Carlo Simulation of Ligand-Protein Binding , 2006, J. Chem. Inf. Model..

[24]  Igor I Baskin,et al.  The One‐Class Classification Approach to Data Description and to Models Applicability Domain , 2010, Molecular informatics.

[25]  Luis A. Sarabia,et al.  Genetic-algorithm-based wavelength selection in multicomponent spectrometric determinations by PLS: application on indomethacin and acemethacin mixture , 1997 .

[26]  Igor I. Baskin,et al.  Stargate GTM: Bridging Descriptor and Activity Spaces , 2015, J. Chem. Inf. Model..

[27]  Jorge López Puga,et al.  Points of Significance: Bayes' theorem , 2015, Nature Methods.

[28]  Hiromasa Kaneko,et al.  Novel soft sensor method for detecting completion of transition in industrial polymer processes , 2011, Comput. Chem. Eng..

[29]  Markus Hartenfeller,et al.  DOGS: Reaction-Driven de novo Design of Bioactive Compounds , 2012, PLoS Comput. Biol..

[30]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[31]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[32]  Bin Chen,et al.  Comparison of Random Forest and Pipeline Pilot Naïve Bayes in Prospective QSAR Predictions , 2012, J. Chem. Inf. Model..

[33]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[34]  Hiromasa Kaneko,et al.  k-nearest neighbor normalized error for visualization and reconstruction – A new measure for data visualization performance , 2018 .