Determining the Validity of a QSAR Model - A Classification Approach

The determination of the validity of a QSAR model when applied to new compounds is an important concern in the field of QSAR and QSPR modeling. Various scoring techniques can be applied to specific types of models. We present a technique with which we can state whether a new compound will be well predicted by a previously built QSAR model. In this study we focus on linear regression models only, though the technique is general and could also be applied to other types of quantitative models. Our technique is based on a classification method that divides regression residuals from a previously generated model into a good class and bad class and then builds a classifier based on this division. The trained classifier is then used to determine the class of the residual for a new compound. We investigated the performance of a variety of classifiers, both linear and nonlinear. The technique was tested on two data sets from the literature and a hand built data set. The data sets selected covered both physical and biological properties and also presented the methodology with quantitative regression models of varying quality. The results indicate that this technique can determine whether a new compound will be well or poorly predicted with weighted success rates ranging from 73% to 94% for the best classifier.

[1]  Carlos R Rodrigues,et al.  Structure-activity relationships of the antimalarial agent artemisinin. 6. The development of predictive in vitro potency models using CoMFA and HQSAR methodologies. , 2002, Journal of medicinal chemistry.

[2]  A. J. Stuper,et al.  Computer assisted studies of chemical structure and biological function , 1979 .

[3]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[4]  Rajarshi Guha,et al.  Development of QSAR Models To Predict and Interpret the Biological Activity of Artemisinin Analogues , 2004, J. Chem. Inf. Model..

[5]  M. Maloof Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown , 2003 .

[6]  Lemont B. Kier,et al.  An Electrotopological-State Index for Atoms in Molecules , 1990, Pharmaceutical Research.

[7]  Peter C. Jurs,et al.  Prediction of the Normal Boiling Points of Organic Compounds from Molecular Structures with a Computational Neural Network Model , 1999, J. Chem. Inf. Comput. Sci..

[8]  N. Japkowicz Learning from Imbalanced Data Sets: A Comparison of Various Strategies * , 2000 .

[9]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[10]  A. K. Madan,et al.  Eccentric Connectivity Index: A Novel Highly Discriminating Topological Descriptor for Structure-Property and Structure-Activity Studies , 1997, J. Chem. Inf. Comput. Sci..

[11]  Terry R. Stouch,et al.  A simple method for the representation, quantification, and comparison of the volumes and shapes of chemical compounds , 1986, J. Chem. Inf. Comput. Sci..

[12]  Bernhard Schölkopf,et al.  Feature selection and transduction for prediction of molecular bioactivity for drug design , 2003, Bioinform..

[13]  L B Kier,et al.  Molecular connectivity VII: specific treatment of heteroatoms. , 1976, Journal of pharmaceutical sciences.

[14]  P. Jurs,et al.  Development and use of charged partial surface area structural descriptors in computer-assisted quantitative structure-property relationship studies , 1990 .

[15]  P. Jurs,et al.  Molecular shape and the prediction of high-performance liquid chromatographic retention indexes of polycyclic aromatic hydrocarbons. , 1987, Analytical chemistry.

[16]  Leo Breiman Using convex pseudo-data to increase prediction accuracy , 1998 .

[17]  S. Unger Molecular Connectivity in Structure–activity Analysis , 1987 .

[18]  P. Jurs,et al.  Studies of Chemical Structure-Biological Activity Relations Using Pattern Recognition , 1979 .

[19]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[20]  Zhiliang Li,et al.  Approach to Estimation and Prediction for Normal Boiling Point (NBP) of Alkanes Based on a Novel Molecular Distance-Edge (MDE) Vector , 1998, J. Chem. Inf. Comput. Sci..

[21]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..