Monte Carlo method for identification of outlier molecules in QSAR studies

The paper presents some difficulties that appear in the application of the classical formula in the identification of “outliers” in a given objects set. The paper proposes a new Monte Carlo-like method for the identification of “outliers” in the calibration set used in QSPR/QSAR computations. Sub-sets of molecules are randomly extracted thousands of times from the given calibration set. The method relies on the idea that the presence of “outlier” molecules in a certain sub-set decreases the prediction power of the QSAR equation that used this particular sub-set of molecules. The presence of “outlier” molecules often leads to poor quality QSAR equations and rarely to high quality QSAR equations. The paper proposes a specific formula for “outlier index”. The molecule with the highest value of the outlier index is eliminated out of the calibration set. The identification/elimination process is repeated until the maximum value of the outlier index stops decreasing. The paper presents five examples of outliers’ identification using various kinds of calibration sets. We compare the results with the results obtained by a classical outlier index formula, using the same calibration set, the same set of descriptors and the same outlier identification/elimination procedure.

[1]  W. Meylan,et al.  Atom/fragment contribution method for estimating octanol-water partition coefficients. , 1995, Journal of pharmaceutical sciences.

[2]  Jacques Weber,et al.  A QSAR study confirming the heterogeneity of the HEPT derivative series regarding their interaction with HIV reverse transcriptase , 1997 .

[3]  Robert J. Jilek,et al.  "Lead hopping". Validation of topomer similarity as a superior predictor of similar biological activities. , 2004, Journal of medicinal chemistry.

[5]  Kenneth Carling,et al.  Resistant outlier rules and the non-Gaussian case , 1998 .

[6]  Vydunas Saltenis,et al.  Outlier Detection Based on the Distribution of Distances between Data Points , 2004, Informatica.

[7]  A G Steele,et al.  Outlier rejection for the weighted-mean KCRV , 2005 .

[8]  Nigel Sim,et al.  Statistical Confidence for Variable Selection in QSAR Models via Monte Carlo Cross-Validation , 2008, J. Chem. Inf. Model..

[9]  A. Svenson,et al.  The importance of outlier detection and training set selection for reliable environmental QSAR predictions. , 2006, Chemosphere.

[10]  Gerald M. Maggiora,et al.  On Outliers and Activity Cliffs-Why QSAR Often Disappoints , 2006, J. Chem. Inf. Model..

[11]  Harvey J. Motulsky,et al.  Detecting outliers when fitting data with nonlinear regression – a new method based on robust nonlinear regression and the false discovery rate , 2006, BMC Bioinformatics.

[12]  Corwin Hansch,et al.  An approach toward the problem of outliers in QSAR. , 2005, Bioorganic & medicinal chemistry.

[13]  Yvan Vander Heyden,et al.  Robust Cross-Validation of Linear Regression QSAR Models , 2008, J. Chem. Inf. Model..

[14]  A J Hopfinger,et al.  Prediction of skin irritation from organic chemicals using membrane-interaction QSAR analysis. , 2001, Toxicological sciences : an official journal of the Society of Toxicology.

[15]  M. Cronin,et al.  Pitfalls in QSAR , 2003 .

[16]  David A. Cosgrove,et al.  Lead Hopping Using SVM and 3D Pharmacophore Fingerprints , 2005, J. Chem. Inf. Model..

[17]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[18]  V. Barnett,et al.  The problem of outlier tests in sample surveys , 1993 .

[19]  Wei Wang,et al.  Detection of outliers and establishment of targets in external quality assessment programs. , 2006, Clinica chimica acta; international journal of clinical chemistry.

[20]  H. E. Solberg,et al.  Detection of outliers in reference distributions: performance of Horn's algorithm. , 2005, Clinical chemistry.

[21]  Ivo A. van der Lans,et al.  Robust canonical discriminant analysis , 1994 .

[22]  Ki Hwan Kim,et al.  Outliers in SAR and QSAR: Is unusual binding mode a possible source of outliers? , 2007, J. Comput. Aided Mol. Des..

[23]  Paul G. Mezey,et al.  The application of iterative optimization techniques to chemical kinetic data of large random error , 1976 .

[24]  J. Stewart Optimization of parameters for semiempirical methods V: Modification of NDDO approximations and application to 70 elements , 2007, Journal of molecular modeling.

[25]  Local intersection volume (LIV) descriptors: 3D-QSAR models for PGI2 receptor ligands , 2002 .