Defining a novel k-nearest neighbours approach to assess the applicability domain of a QSAR model for reliable predictions

BackgroundWith the growing popularity of using QSAR predictions towards regulatory purposes, such predictive models are now required to be strictly validated, an essential feature of which is to have the model’s Applicability Domain (AD) defined clearly. Although in recent years several different approaches have been proposed to address this goal, no optimal approach to define the model’s AD has yet been recognized.ResultsThis study proposes a novel descriptor-based AD method which accounts for the data distribution and exploits k-Nearest Neighbours (kNN) principle to derive a heuristic decision rule. The proposed method is a three-stage procedure to address several key aspects relevant in judging the reliability of QSAR predictions. Inspired from the adaptive kernel method for probability density function estimation, the first stage of the approach defines a pattern of thresholds corresponding to the various training samples and these thresholds are later used to derive the decision rule. Criterion deciding if a given test sample will be retained within the AD is defined in the second stage of the approach. Finally, the last stage tries reflecting upon the reliability in derived results taking model statistics and prediction error into account.ConclusionsThe proposed approach addressed a novel strategy that integrated the kNN principle to define the AD of QSAR models. Relevant features that characterize the proposed AD approach include: a) adaptability to local density of samples, useful when the underlying multivariate distribution is asymmetric, with wide regions of low data density; b) unlike several kernel density estimators (KDE), effectiveness also in high-dimensional spaces; c) low sensitivity to the smoothing parameter k; and d) versatility to implement various distances measures. The results derived on a case study provided a clear understanding of how the approach works and defines the model’s AD for reliable predictions.

[1]  Peter Filzmoser,et al.  Locally centred Mahalanobis distance: a new distance measure with salient features towards outlier detection. , 2013, Analytica chimica acta.

[2]  Manuela Pavan,et al.  The Characterisation of (Quantitative) Structure-Activity Relationships: Preliminary Guidance , 2005 .

[3]  A. Worth,et al.  The prospects for using (Q)SARs in a changing political environment--high expectations and a key role for the european commission's joint research centre , 2004, SAR and QSAR in environmental research.

[4]  Nina Nikolova-Jeliazkova,et al.  An Approach to Determining Applicability Domains for QSAR Group Contribution Models: An Analysis of SRC KOWWIN , 2005, Alternatives to laboratory animals : ATLA.

[5]  Beat Kleiner,et al.  Graphical Methods for Data Analysis , 1983 .

[6]  Davide Ballabio,et al.  Evaluation of model predictive ability by external validation techniques , 2010 .

[7]  Alexander Tropsha,et al.  k Nearest Neighbors QSAR Modeling as a Variational Problem: Theory and Applications , 2005, J. Chem. Inf. Model..

[8]  Alexander Golbraikh,et al.  Rational selection of training and test sets for the development of validated QSAR models , 2003, J. Comput. Aided Mol. Des..

[9]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[10]  Matlab Matlab (the language of technical computing): using matlab graphics ver.5 , 2014 .

[11]  Walter Cedeño,et al.  Using particle swarms for the development of QSAR models based on K-nearest neighbor and kernel regression , 2003, J. Comput. Aided Mol. Des..

[12]  L. Breiman,et al.  Variable Kernel Estimates of Multivariate Densities , 1977 .

[13]  John M. Chambers,et al.  Graphical Methods for Data Analysis , 1983 .

[14]  Emilio Benfenati,et al.  A new hybrid system of QSAR models for predicting bioconcentration factors (BCF). , 2008, Chemosphere.

[15]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[16]  Andreas Bender,et al.  Melting Point Prediction Employing k-Nearest Neighbor Algorithms and Genetic Parameter Optimization , 2006, J. Chem. Inf. Model..

[17]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[18]  Robert P. Sheridan,et al.  Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR , 2004, J. Chem. Inf. Model..

[19]  Juhani Ruuskanen,et al.  Consensus kNN QSAR: a versatile method for predicting the estrogenic activity of organic compounds in silico. A comparative study with five estrogen receptors and a large, diverse set of ligands. , 2004, Environmental science & technology.

[20]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[21]  Roberto Todeschini,et al.  Comparison of Different Approaches to Define the Applicability Domain of QSAR Models , 2012, Molecules.

[22]  Roberto Todeschini,et al.  Comments on the Definition of the Q2 Parameter for QSAR Validation , 2009, J. Chem. Inf. Model..

[23]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[24]  Emilio Benfenati,et al.  Assessment and validation of the CAESAR predictive model for bioconcentration factor (BCF) in fish , 2010, Chemistry Central journal.

[25]  Gergana Dimitrova,et al.  A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models , 2005, J. Chem. Inf. Model..