On the selection of the training set in environmental QSAR analysis when compounds are clustered

In QSAR analysis in environmental sciences, adverse effects of chemicals released to the environment are modelled and predicted as a function of the chemical properties of the pollutants. Usually the set of compounds under study contains several classes of substances, i.e. a more or less strongly clustered set. It is then needed to ensure that the selected training set comprises compounds representing all those chemical classes. Multivariate design in the principal properties of the compound classes is usually appropriate for selecting a meaningful training set. However, with clustered data, often seen in environmental chemistry and toxicology, a single multivariate design may be suboptimal because of the risk of ignoring small classes with few members and only selecting training set compounds from the largest classes. Recently a procedure for training set selection recognizing clustering was proposed by us. In this approach, when non‐selective biological or environmental responses are modelled, local multivariate designs are constructed within each cluster (class). The chosen compounds arising from the local designs are finally united in the overall training set, which thus will contain members from all clusters. The proposed strategy is here further tested and elaborated by applying it to a series of 351 chemical substances for which the soil sorption coefficient is available. These compounds are divided into 14 classes containing between 10 and 52 members. The training set selection is discussed, followed by multivariate QSAR modelling, model interpretation and predictions for the test set. Various types of statistical experimental designs are tested during the training set selection phase. Copyright © 2000 John Wiley & Sons, Ltd.

[1]  Svante Wold,et al.  D-Optimal Designs in QSAR , 1993 .

[2]  T. Lundstedt,et al.  Screening of suitable solvents in organic synthesis. Strategies for solvent selection , 1985 .

[3]  Erik Johansson,et al.  CLUSTER-BASED DESIGN IN ENVIRONMENTAL QSAR , 1997 .

[4]  M. S. Khots,et al.  D-optimal designs , 1995 .

[5]  Erik Johansson,et al.  Multivariate design and modeling in QSAR , 1996 .

[6]  Svante Wold,et al.  A strategy for ranking environmentally occurring chemicals. Part III: Multivariate quantitative structure‐activity relationships for halogenated aliphatics , 1990 .

[7]  G. Cruciani,et al.  Generating Optimal Linear PLS Estimations (GOLPE): An Advanced Chemometric Tool for Handling 3D‐QSAR Problems , 1993 .

[8]  S. Wold,et al.  Multi‐way principal components‐and PLS‐analysis , 1987 .

[9]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[10]  Torbjörn Lundstedt,et al.  Optimum conditions for the Willgerodt-Kindler reaction. III: Amine variation , 1987 .

[11]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[12]  Gabriele Cruciani,et al.  Experimental Design in Synthesis Planning and Structure‐Property Correlations , 1995 .

[13]  Svante Wold,et al.  Multivariate quantitative structure-activity relationships (QSAR): conditions for their applicability , 1983, J. Chem. Inf. Comput. Sci..

[14]  Henk J. M. Verhaar,et al.  QSAR modelling of soil sorption. Improvements and systematics of log KOC vs. log KOW correlations , 1995 .

[15]  Sven Hellberg,et al.  A multivariate approach to QSAR , 1986 .

[16]  S Wold,et al.  Statistical molecular design of building blocks for combinatorial chemistry. , 2000, Journal of medicinal chemistry.