Local Lazy Regression: Making Use of the Neighborhood to Improve QSAR Predictions

Traditional quantitative structure-activity relationship (QSAR) models aim to capture global structure-activity trends present in a data set. In many situations, there may be groups of molecules which exhibit a specific set of features which relate to their activity or inactivity. Such a group of features can be said to represent a local structure-activity relationship. Traditional QSAR models may not recognize such local relationships. In this work, we investigate the use of local lazy regression (LLR), which obtains a prediction for a query molecule using its local neighborhood, rather than considering the whole data set. This modeling approach is especially useful for very large data sets because no a priori model need be built. We applied the technique to three biological data sets. In the first case, the root-mean-square error (RMSE) for an external prediction set was 0.94 log units versus 0.92 log units for the global model. However, LLR was able to characterize a specific group of anomalous molecules with much better accuracy (0.64 log units versus 0.70 log units for the global model). For the second data set, the LLR technique resulted in a decrease in RMSE from 0.36 log units to 0.31 log units for the external prediction set. In the third case, we obtained an RMSE of 2.01 log units versus 2.16 log units for the global model. In all cases, LLR led to a few observations being poorly predicted compared to the global model. We present an analysis of why this was observed and possible improvements to the local regression approach.

[1]  Stuart L. Schreiber,et al.  Identifying Biologically Active Compound Classes Using Phenotypic Screening Data and Sampling Statistics , 2005, J. Chem. Inf. Model..

[2]  Y. Martin,et al.  Do structurally similar molecules have similar biological activity? , 2002, Journal of medicinal chemistry.

[3]  Z. Meral Özsoyoglu,et al.  Indexing large metric spaces for similarity search queries , 1999, TODS.

[4]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[5]  Jon Louis Bentley,et al.  Multidimensional divide-and-conquer , 1980, CACM.

[6]  Yi-Zeng Liang,et al.  New Approach by Kriging Models to Problems in QSAR , 2004, J. Chem. Inf. Model..

[7]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[8]  Mauro Birattari,et al.  Lazy Learning Meets the Recursive Least Squares Algorithm , 1998, NIPS.

[9]  Peter C Jurs,et al.  Assessing the reliability of a QSAR model's predictions. , 2005, Journal of molecular graphics & modelling.

[10]  Gordon M. Crippen,et al.  Prediction of Physicochemical Parameters by Atomic Contributions , 1999, J. Chem. Inf. Comput. Sci..

[11]  David T. Stanton,et al.  Development and Use of Hydrophobic Surface Area (HSA) Descriptors for Computer-Assisted Quantitative Structure-Activity and Structure-Property Relationship Studies , 2004, J. Chem. Inf. Model..

[12]  W. Cleveland,et al.  Regression by local fitting: Methods, properties, and computational algorithms , 1988 .

[13]  Rajarshi Guha,et al.  Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors , 2004, J. Chem. Inf. Model..

[14]  S. Unger Molecular Connectivity in Structure–activity Analysis , 1987 .

[15]  Noel A Cressie,et al.  Statistics for Spatial Data. , 1992 .

[16]  Robert P. Sheridan,et al.  Molecular Transformations as a Way of Finding and Exploiting Consistent Local QSAR , 2006, J. Chem. Inf. Model..

[17]  Rajarshi Guha,et al.  Development of QSAR Models To Predict and Interpret the Biological Activity of Artemisinin Analogues , 2004, J. Chem. Inf. Model..

[18]  Zhiliang Li,et al.  Approach to Estimation and Prediction for Normal Boiling Point (NBP) of Alkanes Based on a Novel Molecular Distance-Edge (MDE) Vector , 1998, J. Chem. Inf. Comput. Sci..

[19]  Jeffrey J. Sutherland,et al.  Spline-Fitting with a Genetic Algorithm: A Method for Developing Classification Structure-Activity Relationships , 2003, J. Chem. Inf. Comput. Sci..

[20]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[21]  Ting Chen,et al.  Scalable Partitioning and Exploration of Chemical Spaces Using Geometric Hashing , 2006, J. Chem. Inf. Model..

[22]  Mauro Birattari,et al.  Local Learning for Iterated Time-Series Prediction , 1999, ICML.

[23]  Yi-Zeng Liang,et al.  Piece-wise quasi-linear modeling in QSAR and analytical calibration based on linear substructures detected by genetic algorithm , 2004 .

[24]  Jian-Hui Jiang,et al.  Optimized Partition of Minimum Spanning Tree for Piecewise Modeling by Particle Swarm Algorithm. QSAR Studies of Antagonism of Angiotensin II Antagonists , 2004, J. Chem. Inf. Model..

[25]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[26]  Carlos R Rodrigues,et al.  Structure-activity relationships of the antimalarial agent artemisinin. 6. The development of predictive in vitro potency models using CoMFA and HQSAR methodologies. , 2002, Journal of medicinal chemistry.

[27]  Yi Li,et al.  Constructing Optimum Blood Brain Barrier QSAR Models Using a Combination of 4D-Molecular Similarity Measures and Cluster Analysis , 2004, J. Chem. Inf. Model..