Statistical external validation and consensus modeling: a QSPR case study for Koc prediction.

The soil sorption partition coefficient (log K(oc)) of a heterogeneous set of 643 organic non-ionic compounds, with a range of more than 6 log units, is predicted by a statistically validated QSAR modeling approach. The applied multiple linear regression (ordinary least squares, OLS) is based on a variety of theoretical molecular descriptors selected by the genetic algorithms-variable subset selection (GA-VSS) procedure. The models were validated for predictivity by different internal and external validation approaches. For external validation we applied self organizing maps (SOM) to split the original data set: the best four-dimensional model, developed on a reduced training set of 93 chemicals, has a predictivity of 78% when applied on 550 validation chemicals (prediction set). The selected molecular descriptors, which could be interpreted through their mechanistic meaning, were compared with the more common physico-chemical descriptors log K(ow) and log S(w). The chemical applicability domain of each model was verified by the leverage approach in order to propose only reliable data. The best predicted data were obtained by consensus modeling from 10 different models in the genetic algorithm model population.

[1]  Han van de Waterbeemd,et al.  Chemometric methods in molecular design , 1995 .

[2]  L. Hall,et al.  Three new consensus QSAR models for the prediction of Ames genotoxicity. , 2004, Mutagenesis.

[3]  T. Öberg A QSAR for the hydroxyl radical reaction rate constant: validation, domain of application, and prediction , 2005 .

[4]  Jeffrey J. Sutherland,et al.  Development of Quantitative Structure-Activity Relationships and Classification Models for Anticonvulsant Activity of Hydantoin Analogues , 2003, J. Chem. Inf. Comput. Sci..

[5]  Juhani Ruuskanen,et al.  Consensus kNN QSAR: a versatile method for predicting the estrogenic activity of organic compounds in silico. A comparative study with five estrogen receptors and a large, diverse set of ligands. , 2004, Environmental science & technology.

[6]  W. Doucette Quantitative structure‐activity relationships for predicting soil‐sediment sorption coefficients for organic chemicals , 2003, Environmental toxicology and chemistry.

[7]  Erik Johansson,et al.  On the selection of the training set in environmental QSAR analysis when compounds are clustered , 2000 .

[8]  Jarmo Huuskonen,et al.  Prediction of Soil Sorption Coefficient of a Diverse Set of Organic Chemicals From Molecular Structure , 2003, J. Chem. Inf. Comput. Sci..

[9]  Alexander Golbraikh,et al.  Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection , 2002, J. Comput. Aided Mol. Des..

[10]  Shu Tao,et al.  Estimation of Organic Carbon Normalized Sorption Coefficient (KOC) for Soils Using the Fragment Constant Method , 1999 .

[11]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[12]  S. Weisberg Plots, transformations, and regression , 1985 .

[13]  Dan C. Fara,et al.  QSPR Treatment of the Soil Sorption Coefficients of Organic Pollutants , 2005, J. Chem. Inf. Model..

[14]  Erik Johansson,et al.  Multivariate design and modeling in QSAR , 1996 .

[15]  B. M. Gawlik,et al.  Alternatives for the determination of the soil adsorption coefficient, Koc, of non-ionicorganic compounds : A review , 1997 .

[16]  James R. Mihelcic,et al.  Estimating Koc for persistent organic pollutants: limitations of correlations with Kow , 2000 .

[17]  Paola Gramatica,et al.  Validated QSAR Prediction of OH Tropospheric Degradation of VOCs: Splitting into Training-Test Sets and Consensus Modeling , 2004, J. Chem. Inf. Model..

[18]  Frank R. Burden,et al.  Use of Automatic Relevance Determination in QSAR Studies Using Bayesian Neural Networks , 2000, J. Chem. Inf. Comput. Sci..

[19]  H Matter,et al.  Random or rational design? Evaluation of diverse compound subsets from chemical structure databases. , 1998, Journal of medicinal chemistry.

[20]  J. Mihelcic,et al.  Reliable QSAR for estimating Koc for persistent organic pollutants: correlation with molecular connectivity indices. , 2001, Chemosphere.

[21]  S. Tao,et al.  Estimation of organic carbon normalized sorption coefficient (Koc) for soils by topological indices and polarity factors , 1999 .

[22]  Gonzalo A. Jaña,et al.  A Simple QSPR Model for Predicting Soil Sorption Coefficients of Polar and Nonpolar Organic Compounds from Molecular Formula , 2003, J. Chem. Inf. Comput. Sci..

[23]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[24]  P. Gramatica,et al.  Modelling and prediction of soil sorption coefficients of non-ionic organic pesticides by molecular descriptors. , 2000, Chemosphere.

[25]  Alexander Golbraikh,et al.  Rational selection of training and test sets for the development of validated QSAR models , 2003, J. Comput. Aided Mol. Des..

[26]  Paola Gramatica,et al.  Statistically Validated QSARs, Based on Theoretical Descriptors, for Modeling Aquatic Toxicity of Organic Chemicals in Pimephales promelas (Fathead Minnow) , 2005, J. Chem. Inf. Model..

[27]  Weida Tong,et al.  QSAR Models Using a Large Diverse Set of Estrogens , 2001, J. Chem. Inf. Comput. Sci..

[28]  R. Boggia,et al.  Genetic algorithms as a strategy for feature selection , 1992 .

[29]  Dan C. Fara,et al.  General and Class Specific Models for Prediction of Soil Sorption Using Various Physicochemical Descriptors , 2002, J. Chem. Inf. Comput. Sci..

[30]  H. Lohninger,et al.  ESTIMATION OF SOIL PARTITION COEFFICIENTS OF PESTICIDES FROM THEIR CHEMICAL STRUCTURE , 1994 .

[31]  Jure Zupan,et al.  Kohonen and counterpropagation artificial neural networks in analytical chemistry , 1997 .

[32]  Danail Bonchev,et al.  Information theoretic indices for characterization of chemical structures , 1983 .

[33]  Alexandru T. Balaban,et al.  Topological indices and real number vertex invariants based on graph eigenvalues or eigenvectors , 1991, J. Chem. Inf. Comput. Sci..

[34]  J. Huuskonen,et al.  Prediction of soil sorption coefficient of organic pesticides from the atom‐type electrotopological state indices , 2003, Environmental toxicology and chemistry.

[35]  Luc Morin-Allory,et al.  2D QSAR Consensus Prediction for High-Throughput Virtual Screening. An Application to COX-2 Inhibition Modeling and Screening of the NCI Database , 2004, J. Chem. Inf. Model..

[36]  P. Bartlett Studies in physical and theoretical chemistry : Vol. 55, semiconductor electrodes. H.O. Finklea (Editor). Elsevier, Amsterdam, 1988, xxii + 520 pp., Dfl.340.00, US$179.00 , 1988 .

[37]  J. Zupan,et al.  Neural Networks in Chemistry , 1993 .

[38]  Tomas Öberg,et al.  A QSAR for Baseline Toxicity: Validation, Domain of Application, and Prediction , 2004 .

[39]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[40]  Henk J. M. Verhaar,et al.  QSAR modelling of soil sorption. Improvements and systematics of log KOC vs. log KOW correlations , 1995 .

[41]  F. Burden,et al.  Robust QSAR models using Bayesian regularized neural networks. , 1999, Journal of medicinal chemistry.