Ensemble partial least squares regression for descriptor selection, outlier detection, applicability domain assessment, and ensemble modeling in QSAR/QSPR modeling

In QSAR/QSPR modeling, building an accurate partial least squares (PLS) model usually involves descriptor selection, outlier detection, applicability domain assessment, nonlinear relationship, and model stability problems. In the present study, we presented an ensemble PLS (EnPLS) method for solving these modeling tasks under a unified methodology framework. EnPLS aims at developing a consistent algorithmic framework by means of the idea of ensemble learning and statistical distribution. The approach exploits the fact that the distribution of PLS model coefficients provides a mechanism for ranking and interpreting the effects of variables, whereas the distribution of prediction errors provides a mechanism for differentiating the outliers from normal samples and assessing the applicability domain of models. The use of statistics of these distributions, namely, mean/median value and standard deviation, inherently provides a feasible way to effectively describe the information contained by the original samples. Furthermore, ensemble modeling and prediction based on several cross‐predictive PLS models could effectively improve the model prediction performance and increase the model stability to a certain extent. The aqueous solubility data are used to demonstrate the ability of our proposed EnPLS method in solving various modeling tasks such as descriptor selection, outlier detection, applicability domain assessment, performance improvement, and model stability. Finally, a freely available R package implementing EnPLS is developed to facilitate the use of chemists and pharmacologists. The R package is freely available at https://github.com/wind22zhu/enpls1.2.

[1]  W. Cai,et al.  A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra , 2008 .

[2]  Dong-Sheng Cao,et al.  Variable importance analysis based on rank aggregation with applications in metabolomics for biomarker discovery. , 2016, Analytica chimica acta.

[3]  Dong-Sheng Cao,et al.  ChemSAR: an online pipelining platform for molecular SAR modeling , 2017, Journal of Cheminformatics.

[4]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[5]  Dong-Sheng Cao,et al.  Predicting human intestinal absorption with modified random forest approach: a comprehensive evaluation of molecular representation, unbalanced data, and applicability domain issues , 2017 .

[6]  Dong-Sheng Cao,et al.  A bootstrapping soft shrinkage approach for variable selection in chemical modeling. , 2016, Analytica chimica acta.

[7]  Dong-Sheng Cao,et al.  The model adaptive space shrinkage (MASS) approach: a new method for simultaneous variable selection and outlier detection based on model population analysis. , 2016, The Analyst.

[8]  Ruisheng Zhang,et al.  QSAR Models for the Prediction of Binding Affinities to Human Serum Albumin Using the Heuristic Method and a Support Vector Machine , 2004, J. Chem. Inf. Model..

[9]  Dong-Sheng Cao,et al.  In silico toxicity prediction of chemicals from EPA toxicity database by kernel fusion-based support vector machines , 2015 .

[10]  Yizeng Liang,et al.  Comparison of quantitative structure-retention relationship models on four stationary phases with different polarity for a diverse set of flavor compounds. , 2012, Journal of chromatography. A.

[11]  Desire L. Massart,et al.  ROBUST PRINCIPAL COMPONENTS REGRESSION AS A DETECTION TOOL FOR OUTLIERS , 1995 .

[12]  P. Legendre,et al.  Forward selection of explanatory variables. , 2008, Ecology.

[13]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[14]  Dong-Sheng Cao,et al.  ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation , 2015, Journal of Cheminformatics.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Ferran Sanz,et al.  Applicability Domain Analysis (ADAN): A Robust Method for Assessing the Reliability of Drug Property Predictions , 2014, J. Chem. Inf. Model..

[17]  Roberto Todeschini,et al.  Comparison of Different Approaches to Define the Applicability Domain of QSAR Models , 2012, Molecules.

[18]  D. Massart,et al.  Elimination of uninformative variables for multivariate calibration. , 1996, Analytical chemistry.

[19]  Gergana Dimitrova,et al.  A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models , 2005, J. Chem. Inf. Model..

[20]  Ting Wang,et al.  Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling , 2005, J. Chem. Inf. Model..

[21]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[22]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[23]  Jaroslaw Polanski,et al.  Modeling Robust QSAR 3: SOM-4D-QSAR with Iterative Variable Elimination IVE-PLS: Application to Steroid, Azo Dye, and Benzoic Acid Series , 2007, J. Chem. Inf. Model..

[24]  Youngjo Lee,et al.  Sparse partial least-squares regression and its applications to high-throughput data analysis , 2011 .

[25]  D L Massart,et al.  Boosting partial least squares. , 2005, Analytical chemistry.

[26]  Dong-Sheng Cao,et al.  A new strategy of outlier detection for QSAR/QSPR , 2009, J. Comput. Chem..

[27]  Paola Gramatica,et al.  QSAR study of malonyl‐CoA decarboxylase inhibitors using GA‐MLR and a new strategy of consensus modeling , 2008, J. Comput. Chem..

[28]  J. Shao Linear Model Selection by Cross-validation , 1993 .

[29]  Kimito Funatsu,et al.  GA Strategy for Variable Selection in QSAR Studies: GA-Based PLS Analysis of Calcium Channel Antagonists , 1997, J. Chem. Inf. Comput. Sci..

[30]  Qingsong Xu,et al.  Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions , 2015, Bioinform..

[31]  Ke Wang,et al.  Bagging for robust non-linear multivariate calibration of spectroscopy , 2011 .

[32]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[33]  Dong-Sheng Cao,et al.  ChemoPy: freely available python package for computational biology and chemoinformatics , 2013, Bioinform..

[34]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[35]  Jaroslaw Polanski,et al.  The Comparative Molecular Surface Analysis (CoMSA) with Modified Uniformative Variable Elimination-PLS (UVE-PLS) Method: Application to the Steroids Binding the Aromatase Enzyme , 2003, J. Chem. Inf. Comput. Sci..

[36]  Dong-Sheng Cao,et al.  Model population analysis for variable selection , 2010 .

[37]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[38]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[39]  Hiromasa Kaneko,et al.  Applicability Domain Based on Ensemble Learning in Classification and Regression Analyses , 2014, J. Chem. Inf. Model..

[40]  R. Yu,et al.  An ensemble of Monte Carlo uninformative variable elimination for wavelength selection. , 2008, Analytica chimica acta.

[41]  Dong-Sheng Cao,et al.  Prediction of aqueous solubility of druglike organic compounds using partial least squares, back‐propagation network and support vector machine , 2010 .

[42]  Bahram Hemmateenejad,et al.  Ant colony optimisation: a powerful tool for wavelength selection , 2006 .

[43]  Ingo Krossing,et al.  Is universal, simple melting point prediction possible? , 2011, Chemphyschem : a European journal of chemical physics and physical chemistry.

[44]  Ping Zhang Model Selection Via Multifold Cross Validation , 1993 .

[45]  M. Hubert,et al.  Robust methods for partial least squares regression , 2003 .

[46]  R. Didziapetris,et al.  Estimation of reliability of predictions and model applicability domain evaluation in the analysis of acute toxicity (LD 50) , 2010, SAR and QSAR in environmental research.

[47]  Romà Tauler,et al.  Detection of Olive Oil Adulteration Using FT-IR Spectroscopy and PLS with Variable Importance of Projection (VIP) Scores , 2012 .

[48]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[49]  L. Buydens,et al.  Use of the bootstrap and permutation methods for a more robust variable importance in the projection metric for partial least squares regression. , 2013, Analytica chimica acta.

[50]  Muthukumarasamy Karthikeyan,et al.  General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks , 2005, J. Chem. Inf. Model..

[51]  Dong-Sheng Cao,et al.  Automatic feature subset selection for decision tree-based ensemble methods in the prediction of bioactivity , 2010 .

[52]  Qing-Song Xu,et al.  Robust principal components regression based on principal sensitivity vectors , 2003 .

[53]  Ramón Carrasco-Velar,et al.  Quantitative study of the structure-retention index relationship in the imine family. , 2006, Journal of chromatography. A.

[54]  José Julio Espina Agulló,et al.  The multivariate least-trimmed squares estimator , 2008 .

[55]  Dong-Sheng Cao,et al.  BioTriangle: a web-accessible platform for generating various molecular representations for chemicals, proteins, DNAs/RNAs and their interactions , 2016, Journal of Cheminformatics.

[56]  Xueguang Shao,et al.  A consensus least squares support vector regression (LS-SVR) for analysis of near-infrared spectra of plant samples. , 2007, Talanta.

[57]  Hongdong Li,et al.  Key wavelengths screening using competitive adaptive reweighted sampling method for multivariate calibration. , 2009, Analytica chimica acta.

[58]  Dong-Sheng Cao,et al.  ADME Properties Evaluation in Drug Discovery: Prediction of Caco-2 Cell Permeability Using a Combination of NSGA-II and Boosting , 2016, J. Chem. Inf. Model..

[59]  Dong-Sheng Cao,et al.  Support Vector Machines and Their Application in Chemistry and Biotechnology , 2011 .

[60]  David E. Clark,et al.  Evolutionary algorithms in computer-aided molecular design , 1996, J. Comput. Aided Mol. Des..

[61]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[62]  Dong-Sheng Cao,et al.  The boosting: A new idea of building models , 2010 .

[63]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[64]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[65]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[66]  Rosario Romera,et al.  On robust partial least squares (PLS) methods , 1998 .

[67]  S. Keleş,et al.  Sparse partial least squares regression for simultaneous dimension reduction and variable selection , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[68]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[69]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[70]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[71]  John C Dearden,et al.  Quantitative structure‐property relationships for prediction of boiling point, vapor pressure, and melting point , 2003, Environmental toxicology and chemistry.

[72]  W. Cai,et al.  An improved boosting partial least squares method for near-infrared spectroscopic quantitative analysis. , 2010, Analytica chimica acta.

[73]  Jie Dong,et al.  TargetNet: a web service for predicting potential drug–target interaction profiling via multi-target SAR models , 2016, Journal of Computer-Aided Molecular Design.

[74]  Hongdong Li,et al.  Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features , 2011, J. Comput. Aided Mol. Des..

[75]  Richard Jensen,et al.  Ant colony optimization as a feature selection method in the QSAR modeling of anti-HIV-1 activities of 3-(3,5-dimethylbenzyl)uracil derivatives using MLR, PLS and SVM regressions , 2009 .

[76]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .