Statistical Confidence for Variable Selection in QSAR Models via Monte Carlo Cross-Validation

A new variable selection wrapper method named the Monte Carlo variable selection (MCVS) method was developed utilizing the framework of the Monte Carlo cross-validation (MCCV) approach. The MCVS method reports the variable selection results in the most conventional and common measure of statistical hypothesis testing, the P-values, thus allowing for a clear and simple statistical interpretation of the results. The MCVS method is equally applicable to the multiple-linear-regression (MLR)-based or non-MLR-based quantitative structure-activity relationship (QSAR) models. The method was applied to blood-brain barrier (BBB) permeation and human intestinal absorption (HIA) QSAR problems using MLR to demonstrate the workings of the new approach. Starting from more than 1600 molecular descriptors, only two (TPSA(NO) and ALOGP) yielded acceptably low P-values for the BBB and HIA problems, respectively. The new method has been implemented in the QSAR-BENCH v2 program, which is freely available (including its Java source code) from www.dmitrykonovalov.org for academic use.

[1]  Y Vander Heyden,et al.  Evaluation of chromatographic descriptors for the prediction of gastro-intestinal absorption of drugs. , 2007, Journal of chromatography. A.

[2]  J. Gasteiger,et al.  FROM ATOMS AND BONDS TO THREE-DIMENSIONAL ATOMIC COORDINATES : AUTOMATIC MODEL BUILDERS , 1993 .

[3]  Stuart J. Russell,et al.  NP-Completeness of Searches for Smallest Possible Feature Sets , 1994 .

[4]  Thomas Lengauer,et al.  Ensemble Methods for Classification in Cheminformatics , 2004, J. Chem. Inf. Model..

[5]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[6]  W. L. Jorgensen,et al.  Prediction of Properties from Simulations: Free Energies of Solvation in Hexadecane, Octanol, and Water , 2000 .

[7]  Bernard Testa,et al.  A simple model to predict blood-brain barrier permeation from 3D molecular fields. , 2002, Biochimica et biophysica acta.

[8]  J. Platts,et al.  Correlation and prediction of a large blood-brain distribution data set--an LFER study. , 2001, European journal of medicinal chemistry.

[9]  F. Burden Molecular identification number for substructure searches , 1989, J. Chem. Inf. Comput. Sci..

[10]  Paola Gramatica,et al.  Introduction General Considerations , 2022 .

[11]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[12]  Dmitry A. Konovalov,et al.  Accuracy of Four Heuristics for the Full Sibship Reconstruction Problem in the Presence of Genotype Errors , 2005, APBC.

[13]  Alexander Tropsha,et al.  k Nearest Neighbors QSAR Modeling as a Variational Problem: Theory and Applications , 2005, J. Chem. Inf. Model..

[14]  Anne Hersey,et al.  On the mechanism of human intestinal absorption. , 2002, European journal of medicinal chemistry.

[15]  Igor V Tetko,et al.  Computing chemistry on the web. , 2005, Drug discovery today.

[16]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[17]  M. Abraham,et al.  A data base for partition of volatile organic compounds and drugs from blood/plasma/serum to brain, and an LFER analysis of the data. , 2006, Journal of pharmaceutical sciences.

[18]  Alessandro Giuliani,et al.  THE INFORMATION CONTENT OF THE EIGENVALUES FROM MODIFIED ADJACENCY MATRICES : LARGE SCALE AND SMALL SCALE CORRELATIONS , 1999 .

[19]  Andreas Zell,et al.  Feature Selection for Descriptor Based Classification Models. 1. Theory and GA-SEC Algorithm , 2004, J. Chem. Inf. Model..

[20]  Yvan Vander Heyden,et al.  Benchmarking of QSAR Models for Blood-Brain Barrier Permeation , 2007, J. Chem. Inf. Model..

[21]  D. Coomans,et al.  Exploration of linear modelling techniques and their combination with multivariate adaptive regression splines to predict gastro-intestinal absorption of drugs. , 2007, Journal of pharmaceutical and biomedical analysis.

[22]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[23]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[24]  Pierre Baldi,et al.  ChemDB: a public database of small molecules and related chemoinformatics resources , 2005, Bioinform..

[25]  A. Ghose,et al.  Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragmental Methods: An Analysis of ALOGP and CLOGP Methods , 1998 .

[26]  Dmitry A. Konovalov,et al.  Partition-distance via the assignment problem , 2005, Bioinform..

[27]  Meihua Tu,et al.  Development of a computational approach to predict blood-brain barrier permeability. , 2004, Drug metabolism and disposition: the biological fate of chemicals.

[28]  Igor V. Tetko,et al.  Virtual Computational Chemistry Laboratory – Design and Description , 2005, J. Comput. Aided Mol. Des..

[29]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[30]  P. Selzer,et al.  Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. , 2000, Journal of medicinal chemistry.

[31]  Franco Lombardo,et al.  A recursive-partitioning model for blood–brain barrier permeation , 2005, J. Comput. Aided Mol. Des..

[32]  G Beck,et al.  Evaluation of human intestinal absorption data and subsequent derivation of a quantitative structure-activity relationship (QSAR) with the Abraham descriptors. , 2001, Journal of pharmaceutical sciences.

[33]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[34]  A. Tropsha,et al.  kappa Nearest neighbors QSAR modeling as a variational problem: theory and applications. , 2005, Journal of chemical information and modeling.

[35]  Alexandre Varnek,et al.  Correlation of blood-brain penetration using structural descriptors. , 2006, Bioorganic & medicinal chemistry.

[36]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[37]  Tingjun Hou,et al.  ADME evaluation in drug discovery , 2002, Journal of molecular modeling.

[38]  Peter C. Jurs,et al.  Automated Descriptor Selection for Quantitative Structure-Activity Relationships Using Generalized Simulated Annealing , 1995, J. Chem. Inf. Comput. Sci..

[39]  Douglas B. Kitchen,et al.  Computational models to predict blood–brain barrier permeation and CNS activity , 2003, J. Comput. Aided Mol. Des..

[40]  Andreas Zell,et al.  Feature Selection for Descriptor Based Classification Models. 2. Human Intestinal Absorption (HIA) , 2004, J. Chem. Inf. Model..

[41]  Denis M. Bayada,et al.  Polar Molecular Surface as a Dominating Determinant for Oral Absorption and Brain Penetration of Drugs , 1999, Pharmaceutical Research.

[42]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[43]  Arup K. Ghose,et al.  Atomic physicochemical parameters for three dimensional structure directed quantitative structure-activity relationships. 4. Additional parameters for hydrophobic and dispersive interactions and their application for an automated superposition of certain naturally occurring nucleoside antibiotics , 1989, J. Chem. Inf. Comput. Sci..

[44]  Tomasz Arodz,et al.  Computational methods in developing quantitative structure-activity relationships (QSAR): a review. , 2006, Combinatorial chemistry & high throughput screening.

[45]  Lemont B. Kier,et al.  Modeling Blood-Brain Barrier Partitioning Using the Electrotopological State , 2002, J. Chem. Inf. Comput. Sci..

[46]  Jonas Boström,et al.  Reproducing the conformations of protein-bound ligands: A critical evaluation of several popular conformational searching tools , 2001, J. Comput. Aided Mol. Des..

[47]  S. Hirono,et al.  Comparison of Reliability of log P Values for Drugs Calculated by Several Methods , 1994 .

[48]  Robert S. Pearlman,et al.  Metric Validation and the Receptor-Relevant Subspace Concept , 1999, J. Chem. Inf. Comput. Sci..

[49]  Paola Gramatica,et al.  Structure/Response Correlations and Similarity/Diversity Analysis by GETAWAY Descriptors, 2. Application of the Novel 3D Molecular Descriptors to QSAR/QSPR Studies , 2002, J. Chem. Inf. Comput. Sci..

[50]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[51]  Rajarshi Guha,et al.  Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors , 2004, J. Chem. Inf. Model..

[52]  F. Burden A CHEMICALLY INTUITIVE MOLECULAR INDEX BASED ON THE EIGENVALUES OF A MODIFIED ADJACENCY MATRIX , 1997 .

[53]  Dmitry A. Konovalov,et al.  Modified SIMPSON O(n3) algorithm for the full sibship reconstruction problem , 2005, Bioinform..

[54]  Yiannis N. Kaznessis,et al.  Prediction of blood-brain partitioning using Monte Carlo simulations of molecules in water , 2001, J. Comput. Aided Mol. Des..

[55]  P Gramatica,et al.  3D-modelling and prediction by WHIM descriptors. Part 8. Toxicity and physico-chemical properties of environmental priority chemicals by 2D-TI and 3D-WHIM descriptors. , 1997, SAR and QSAR in environmental research.

[56]  J L Katz,et al.  2D QSAR modeling and preliminary database searching for dopamine transporter inhibitors using genetic algorithm variable selection of Molconn Z descriptors. , 2000, Journal of medicinal chemistry.

[57]  Ramamurthi Narayanan,et al.  In silico ADME modelling: prediction models for blood-brain barrier permeation using a systematic variable selection method. , 2005, Bioorganic & medicinal chemistry.

[58]  P. Djurić,et al.  Model selection by cross-validation , 1990, IEEE International Symposium on Circuits and Systems.

[59]  Peter C. Jurs,et al.  Prediction of Human Intestinal Absorption of Drug Compounds from Molecular Structure , 1998, J. Chem. Inf. Comput. Sci..

[60]  D. E. Clark Rapid calculation of polar molecular surface area and its application to the prediction of transport phenomena. 1. Prediction of intestinal absorption. , 1999, Journal of pharmaceutical sciences.

[61]  S. Hirono,et al.  Simple Method of Calculating Octanol/Water Partition Coefficient. , 1992 .

[62]  Roberto Todeschini,et al.  Structure/Response Correlations and Similarity/Diversity Analysis by GETAWAY Descriptors, 1. Theory of the Novel 3D Molecular Descriptors , 2002, J. Chem. Inf. Comput. Sci..

[63]  K. M. Smith,et al.  Novel software tools for chemical diversity , 1998 .

[64]  Douglas M. Hawkins,et al.  Assessing Model Fit by Cross-Validation , 2003, J. Chem. Inf. Comput. Sci..

[65]  Andreas Zell,et al.  Prediction of Aqueous Solubility and Partition Coefficient Optimized by a Genetic Algorithm Based Descriptor Selection Method , 2003, J. Chem. Inf. Comput. Sci..

[66]  F. A. Pasha,et al.  QSTR Study of Small Organic Molecules against Tetrahymena pyriformis , 2005 .