The better predictive model: High q2 for the training set or low root mean square error of prediction for the test set?

The process of validation of computational models (e.g., QSARs) may become the most important step in their development. Different requirements for the reliability and predictability of QSAR models have been described in the literature. Despite these formal recommendations there are few practical rules as to when to cease adding variables to a QSAR (i.e., what is an appropriate level of complexity of the model). In this work the influence of model complexity to statistical fit and error have been investigated using toxicity data for 200 phenols to the ciliated protozoan Tetrahymena pyriformis when applying a test set of a further 50 compounds. The results from this investigation showed that some important factors play a role in the definition of a good and reliable QSAR. These include the fact that q 2 is not a good criterion for a model predictivity; that outliers should not necessarily be deleted as this may reduce the chemical space of the model; the number of descriptors in a multivariate model should be chosen carefully to avoid model under- and over-estimation; and that an appropriate number of dimensions is required for PLS modelling.

[1]  P. Libby The Scientific American , 1881, Nature.

[2]  J. Topliss,et al.  Chance correlations in structure-activity studies using multiple regression analysis , 1972 .

[3]  Stefan H. Unger,et al.  Model building in structure-activity relations. Reexamination of adrenergic blocking activity of .beta.-halo-.beta.-arylalkylamines , 1973 .

[4]  Hxugo Kubiny Variable Selection in QSAR Studies. I. An Evolutionary Algorithm , 1994 .

[5]  John Horgan,et al.  From Complexity to Perplexity , 1995 .

[6]  Ranbir Singh,et al.  J. Mol. Struct. (Theochem) , 1996 .

[7]  C. Russom,et al.  Predicting modes of toxic action from chemical structure: Acute toxicity in the fathead minnow (Pimephales promelas) , 1997 .

[8]  T. W. Schultz,et al.  TETRATOX: TETRAHYMENA PYRIFORMIS POPULATION GROWTH IMPAIRMENT ENDPOINTA SURROGATE FOR FISH LETHALITY , 1997 .

[9]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[10]  Donald C. Mikulecky,et al.  The Emergence of Complexity: Science Coming of Age Or Science Growing Old? , 2001, Comput. Chem..

[11]  Mark T D Cronin,et al.  Comparative assessment of methods to develop QSARs for the prediction of the toxicity of phenols to Tetrahymena pyriformis. , 2002, Chemosphere.

[12]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[13]  Mark T. D. Cronin,et al.  Multivariate Discrimination between Modes of Toxic Action of Phenols , 2002 .

[14]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[15]  John D. Walker,et al.  Use of QSARs in international decision-making frameworks to predict health effects of chemical substances. , 2003, Environmental health perspectives.

[16]  Ralph Kühne,et al.  Stepwise discrimination between four modes of toxic action of phenols in the Tetrahymena pyriformis assay. , 2003, Chemical research in toxicology.

[17]  M. Cronin,et al.  Pitfalls in QSAR , 2003 .

[18]  John D. Walker,et al.  Use of QSARs in international decision-making frameworks to predict ecologic effects and environmental fate of chemical substances. , 2003, Environmental health perspectives.

[19]  Mark T. D. Cronin,et al.  A Framework for Promoting the Acceptance and Regulatory Use of ( Quantitative) Structure- Activity Relationships , 2004 .

[20]  A P Worth,et al.  The role of the European centre for the validation of alternative methods (ECVAM) in the validation of (Q)SARs , 2004, SAR and QSAR in environmental research.

[21]  A. Worth,et al.  The prospects for using (Q)SARs in a changing political environment--high expectations and a key role for the european commission's joint research centre , 2004, SAR and QSAR in environmental research.

[22]  Mark T. D. Cronin,et al.  Predicting Chemical Toxicity and Fate , 2004 .