A new strategy of outlier detection for QSAR/QSPR

The crucial step of building a high performance QSAR/QSPR model is the detection of outliers in the model. Detecting outliers in a multivariate point cloud is not trivial, especially when several outliers coexist in the model. The classical identification methods do not always identify them, because they are based on the sample mean and covariance matrix influenced by the outliers. Moreover, existing methods only lay stress on some type of outliers but not all the outliers. To avoid these problems and detect all kinds of outliers simultaneously, we provide a new strategy based on Monte‐Carlo cross‐validation, which was termed as the MC method. The MC method inherently provides a feasible way to detect different kinds of outliers by establishment of many cross‐predictive models. With the help of the distribution of predictive residuals such obtained, it seems to be able to reduce the risk caused by the masking effect. In addition, a new display is proposed, in which the absolute values of mean value of predictive residuals are plotted versus standard deviations of predictive residuals. The plot divides the data into normal samples, y direction outliers and X direction outliers. Several examples are used to demonstrate the detection ability of MC method through the comparison of different diagnostic methods. © 2009 Wiley Periodicals, Inc. J Comput Chem, 2010

[1]  Randy J. Pell,et al.  Multiple outlier detection for multivariate calibration using robust statistical techniques , 2000 .

[2]  Paul J. Gemperline,et al.  Classification of Near-Infrared Spectra Using Wavelength Distances: Comparison to the Mahalanobis Distance and Residual Variance Methods , 1995 .

[3]  Bell Telephone,et al.  ROBUST ESTIMATES, RESIDUALS, AND OUTLIER DETECTION WITH MULTIRESPONSE DATA , 1972 .

[4]  J. Tukey,et al.  The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data , 1974 .

[5]  Johann Gasteiger,et al.  Linear and nonlinear functions on modeling of aqueous solubility of organic compounds by two structure representation methods , 2004, J. Comput. Aided Mol. Des..

[6]  R. Pearl Biometrics , 1914, The American Naturalist.

[7]  Andrew J. Chalk,et al.  A Quantum Mechanical/Neural Net Model for Boiling Points with Error Estimation , 2001, J. Chem. Inf. Comput. Sci..

[8]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[9]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[10]  James E. Gentle,et al.  Elements of computational statistics , 2002 .

[11]  V. Tantishaiyakul,et al.  Prediction of the aqueous solubility of benzylamine salts using QSPR model. , 2005, Journal of pharmaceutical and biomedical analysis.

[12]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[13]  Emilio Benfenati,et al.  Modeling Toxicity by Using Supervised Kohonen Neural Networks , 2003, J. Chem. Inf. Comput. Sci..

[14]  Yi-Zeng Liang,et al.  Robust methods for multivariate analysis — a tutorial review , 1996 .

[15]  H. J. H. Macfie,et al.  A robust PLS procedure , 1992 .

[16]  Robert C. Glen,et al.  Random Forest Models To Predict Aqueous Solubility , 2007, J. Chem. Inf. Model..

[17]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[18]  Mark T D Cronin,et al.  Essential and desirable characteristics of ecotoxicity quantitative structure–activity relationships , 2003, Environmental toxicology and chemistry.

[19]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[20]  David W. Scott The New S Language , 1990 .

[21]  D. Massart,et al.  The Mahalanobis distance , 2000 .

[22]  G. V. Kass,et al.  Location of Several Outliers in Multiple-Regression Data Using Elemental Sets , 1984 .

[23]  D. Livingstone,et al.  Prediction of aqueous solubility for a diverse set of organic compounds based on atom-type electrotopological state indices. , 2000, European journal of medicinal chemistry.

[24]  K. Satyanarayana,et al.  Note: Correlation of flash points , 1991 .

[25]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[26]  Peter C. Jurs,et al.  Prediction of Normal Boiling Points for a Diverse Set of Industrially Important Organic Compounds from Molecular Structure , 1995, J. Chem. Inf. Comput. Sci..

[27]  L. Leemis Applied Linear Regression Models , 1991 .

[28]  Ping Zhang Model Selection Via Multifold Cross Validation , 1993 .

[29]  D. Massart,et al.  Outlier Detection in Calibration , 1990 .

[30]  Muthukumarasamy Karthikeyan,et al.  General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks , 2005, J. Chem. Inf. Model..

[31]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[32]  Emilio Benfenati,et al.  QSAR Model for Predicting Pesticide Aquatic Toxicity , 2005, J. Chem. Inf. Model..

[33]  I. Zilberberg,et al.  Paired Orbitals for Different Spins equations , 2008 .

[34]  P. Rousseeuw,et al.  Unmasking Multivariate Outliers and Leverage Points , 1990 .

[35]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[36]  E. D. Rest,et al.  Statistical Theory and Methodology in Science and Engineering , 1963 .

[37]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[38]  W L Jorgensen,et al.  Prediction of drug solubility from Monte Carlo simulations. , 2000, Bioorganic & medicinal chemistry letters.

[39]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[40]  Qing-Song Xu,et al.  Robust principal components regression based on principal sensitivity vectors , 2003 .

[41]  Georg Ch. Pflug,et al.  Mathematical statistics and applications , 1985 .

[42]  Torsten Schindler,et al.  Toward robust QSPR models: Synergistic utilization of robust regression and variable elimination , 2008, J. Comput. Chem..

[43]  Yvan Vander Heyden,et al.  Robust Cross-Validation of Linear Regression QSAR Models , 2008, J. Chem. Inf. Model..

[44]  S. Morgan,et al.  Outlier detection in multivariate analytical chemical data. , 1998, Analytical chemistry.

[45]  J. Neter,et al.  Applied Linear Regression Models , 1983 .

[46]  Jörg Huwyler,et al.  Computational aqueous solubility prediction for drug-like compounds in congeneric series. , 2008, European journal of medicinal chemistry.

[47]  Desire L. Massart,et al.  ROBUST PRINCIPAL COMPONENTS REGRESSION AS A DETECTION TOOL FOR OUTLIERS , 1995 .

[48]  Herman Wold,et al.  Systems under indirect observation : causality, structure, prediction , 1982 .

[49]  Ralph Kühne,et al.  External Validation and Prediction Employing the Predictive Squared Correlation Coefficient Test Set Activity Mean vs Training Set Activity Mean , 2008, J. Chem. Inf. Model..

[50]  Terry S. Carlton,et al.  Correlation of Boiling Points with Molecular Structure for Chlorofluoroethanes , 1998, J. Chem. Inf. Comput. Sci..

[51]  P. Gemperline,et al.  Combination of the Mahalanobis distance and residual variance pattern recognition techniques for classification of near-infrared reflectance spectra , 1990 .

[52]  P. Rousseeuw,et al.  Least median of squares: a robust method for outlier and model error detection in regression and calibration , 1986 .

[53]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[54]  Johann Gasteiger,et al.  Prediction of Aqueous Solubility of Organic Compounds Based on a 3D Structure Representation , 2003, J. Chem. Inf. Comput. Sci..

[55]  Nigel Sim,et al.  Statistical Confidence for Variable Selection in QSAR Models via Monte Carlo Cross-Validation , 2008, J. Chem. Inf. Model..

[56]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[57]  Robert C. Glen,et al.  Solubility Challenge: Can You Predict Solubilities of 32 Molecules Using a Database of 100 Reliable Measurements? , 2008, J. Chem. Inf. Model..

[58]  H. Mark,et al.  Qualitative near-infrared reflectance analysis using Mahalanobis distances , 1985 .

[59]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[60]  Yuhong Yang,et al.  Information Theory, Inference, and Learning Algorithms , 2005 .

[61]  D. Steinberg,et al.  Technometrics , 2008 .

[62]  R. Welsch,et al.  The Hat Matrix in Regression and ANOVA , 1978 .

[63]  Robert Stanforth,et al.  The quality of QSAR models: problems and solutions , 2007, SAR and QSAR in environmental research.

[64]  Ulf Norinder,et al.  Molecular Descriptors Influencing Melting Point and Their Role in Classification of Solid Drugs , 2003, J. Chem. Inf. Comput. Sci..

[65]  William J. Owen,et al.  Elements of Computational Statistics , 2003, Technometrics.

[66]  Romualdo Benigni,et al.  Predictivity of QSAR , 2008, J. Chem. Inf. Model..

[67]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[68]  Yadolah Dodge,et al.  The Guinea Pig of Multiple Regression , 1996 .

[69]  W. Fung,et al.  Unmasking Outliers and Leverage Points: A Confirmation , 1993 .

[70]  J. Shao Bootstrap Model Selection , 1996 .