The model adaptive space shrinkage (MASS) approach: a new method for simultaneous variable selection and outlier detection based on model population analysis.

Variable selection and outlier detection are important processes in chemical modeling. Usually, they affect each other. Their performing orders also strongly affect the modeling results. Currently, many studies perform these processes separately and in different orders. In this study, we examined the interaction between outliers and variables and compared the modeling procedures performed with different orders of variable selection and outlier detection. Because the order of outlier detection and variable selection can affect the interpretation of the model, it is difficult to decide which order is preferable when the predictabilities (prediction error) of the different orders are relatively close. To address this problem, a simultaneous variable selection and outlier detection approach called Model Adaptive Space Shrinkage (MASS) was developed. This proposed approach is based on model population analysis (MPA). Through weighted binary matrix sampling (WBMS) from model space, a large number of partial least square (PLS) regression models were built, and the elite parts of the models were selected to statistically reassign the weight of each variable and sample. Then, the whole process was repeated until the weights of the variables and samples converged. Finally, MASS adaptively found a high performance model which consisted of the optimized variable subset and sample subset. The combination of these two subsets could be considered as the cleaned dataset used for chemical modeling. In the proposed approach, the problem of the order of variable selection and outlier detection is avoided. One near infrared spectroscopy (NIR) dataset and one quantitative structure-activity relationship (QSAR) dataset were used to test this approach. The result demonstrated that MASS is a useful method for data cleaning before building a predictive model.

[1]  W. Cai,et al.  A variable selection method based on uninformative variable elimination for multivariate calibration of near-infrared spectra , 2008 .

[2]  Timothy M. D. Ebbels,et al.  Genetic algorithms for simultaneous variable and sample selection in metabonomics , 2009, Bioinform..

[3]  Dong-Sheng Cao,et al.  In silico evaluation of logD7.4 and comparison with other prediction methods , 2015 .

[4]  R. Aalizadeh,et al.  3D-QSAR and docking studies on adenosine A2A receptor antagonists by the CoMFA method , 2015, SAR and QSAR in environmental research.

[5]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[6]  R. Leardi,et al.  Genetic algorithms applied to feature selection in PLS regression: how and when to use them , 1998 .

[7]  Roy E. Welsch,et al.  A diagnostic method for simultaneous feature selection and outlier identification in linear regression , 2010, Comput. Stat. Data Anal..

[8]  W. Krzanowski,et al.  Simultaneous variable selection and outlier identification in linear regression using the mean-shift outlier model , 2008 .

[9]  J. Tolvi,et al.  Genetic algorithms for outlier detection and variable selection in linear regression models , 2004, Soft Comput..

[10]  Qing-Song Xu,et al.  Using variable combination population analysis for variable selection in multivariate calibration. , 2015, Analytica chimica acta.

[11]  P. Rousseeuw Least Median of Squares Regression , 1984 .

[12]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[13]  Dong-Sheng Cao,et al.  Model-population analysis and its applications in chemical and biological modeling , 2012 .

[14]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[15]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[16]  Yuhui Shi,et al.  Particle swarm optimization: developments, applications and resources , 2001, Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No.01TH8546).

[17]  Patrick Wiegand,et al.  Simultaneous variable selection and outlier detection using a robust genetic algorithm , 2009 .

[18]  D. Madigan,et al.  A method for simultaneous variable selection and outlier identification in linear regression , 1996 .

[19]  N. Nagelkerke,et al.  A note on a general definition of the coefficient of determination , 1991 .

[20]  M. Gevrey,et al.  Review and comparison of methods to study the contribution of variables in artificial neural network models , 2003 .

[21]  Hongdong Li,et al.  Toward better QSAR/QSPR modeling: simultaneous outlier detection and variable selection using distribution of model features , 2011, J. Comput. Aided Mol. Des..

[22]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[23]  M. Shahlaei Descriptor selection methods in quantitative structure-activity relationship studies: a review study. , 2013, Chemical reviews.

[24]  W. Marsden I and J , 2012 .

[25]  P. Legendre,et al.  Forward selection of explanatory variables. , 2008, Ecology.

[26]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[27]  Dong-Sheng Cao,et al.  A new strategy of outlier detection for QSAR/QSPR , 2009, J. Comput. Chem..

[28]  John H. Kalivas,et al.  Comparison of Forward Selection, Backward Elimination, and Generalized Simulated Annealing for Variable Selection , 1993 .

[29]  Lunzhao Yi,et al.  A novel variable selection approach that iteratively optimizes variable space using weighted binary matrix sampling. , 2014, The Analyst.

[30]  George A. Marcoulides,et al.  Modern methods for business research , 1998 .

[31]  Yong-Huan Yun,et al.  A new method for wavelength interval selection that intelligently optimizes the locations, widths and combinations of the intervals. , 2015, The Analyst.

[32]  Dong-Sheng Cao,et al.  Prediction of aqueous solubility of druglike organic compounds using partial least squares, back‐propagation network and support vector machine , 2010 .

[33]  Dong-Sheng Cao,et al.  A strategy that iteratively retains informative variables for selecting optimal variable subset in multivariate calibration. , 2014, Analytica chimica acta.

[34]  Rosario Romera,et al.  On robust partial least squares (PLS) methods , 1998 .