Automatic outlier sample detection based on regression analysis and repeated ensemble learning

Abstract The fields of chemoinformatics and chemometrics require regression models with high prediction performance. To construct predictive regression models by appropriately detecting outlier samples, a new outlier detection and regression method based on ensemble learning is proposed. Multiple regression models are constructed and y-values are estimated based on ensemble learning. Outlier samples are then detected by comprehensively considering all regression models. Furthermore, it is possible to detect outlier samples robustly and independently by repeated calculations. By analyzing a numerical simulation dataset, two quantitative structure-activity relationship datasets and two quantitative structure-property relationship datasets, it is confirmed that automatic outlier sample detection can be achieved, informative compounds can be selected, and the estimation performance of regression models is improved.

[1]  Yi Hu,et al.  Fault Detection and Identification Based on the Neighborhood Standardized Local Outlier Factor Method , 2013, Industrial & Engineering Chemistry Research.

[2]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery. 4. Prediction of Aqueous Solubility Based on Atom Contribution Approach , 2004, J. Chem. Inf. Model..

[3]  Desire L. Massart,et al.  A methodology to detect outliers/inliers in prediction with PLS , 2003 .

[4]  Michael A Babyak,et al.  What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models , 2004, Psychosomatic medicine.

[5]  Randy J. Pell,et al.  Multiple outlier detection for multivariate calibration using robust statistical techniques , 2000 .

[6]  Brian K. Shoichet,et al.  ZINC - A Free Database of Commercially Available Compounds for Virtual Screening , 2005, J. Chem. Inf. Model..

[7]  Constantinos S. Pattichis,et al.  De Novo Drug Design Using Multiobjective Evolutionary Graphs , 2009, J. Chem. Inf. Model..

[8]  Igor V. Tetko,et al.  Critical Assessment of QSAR Models of Environmental Toxicity against Tetrahymena pyriformis: Focusing on Applicability Domain and Overfitting by Variable Selection , 2008, J. Chem. Inf. Model..

[9]  M. Hubert,et al.  A robust PCR method for high‐dimensional regressors , 2003 .

[10]  Gerald M. Maggiora,et al.  On Outliers and Activity Cliffs-Why QSAR Often Disappoints , 2006, J. Chem. Inf. Model..

[11]  Muthukumarasamy Karthikeyan,et al.  General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks , 2005, J. Chem. Inf. Model..

[12]  S. Wold,et al.  PLS-regression: a basic tool of chemometrics , 2001 .

[13]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[14]  Jeffrey J. Sutherland,et al.  Spline-Fitting with a Genetic Algorithm: A Method for Developing Classification Structure-Activity Relationships , 2003, J. Chem. Inf. Comput. Sci..

[15]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[16]  Huangang Wang,et al.  Robust one-class SVM for fault detection , 2016 .

[17]  Jürgen Bajorath,et al.  Extending the Activity Cliff Concept: Structural Categorization of Activity Cliffs and Systematic Identification of Different Types of Cliffs in the ChEMBL Database , 2012, J. Chem. Inf. Model..

[18]  Ronald K. Pearson,et al.  Outliers in process modeling and identification , 2002, IEEE Trans. Control. Syst. Technol..

[19]  Peter Filzmoser,et al.  Review of sparse methods in regression and classification with application to chemometrics , 2012 .

[20]  Yi-Zeng Liang,et al.  Model population analysis in chemometrics , 2015 .

[21]  Roberto Todeschini,et al.  Comments on the Definition of the Q2 Parameter for QSAR Validation , 2009, J. Chem. Inf. Model..

[22]  Wenhui Fan,et al.  Multimode Process Fault Detection Based on Local Density Ratio-Weighted Support Vector Data Description , 2017 .

[23]  Hiromasa Kaneko,et al.  Applicability Domain Based on Ensemble Learning in Classification and Regression Analyses , 2014, J. Chem. Inf. Model..

[24]  Desire L. Massart,et al.  Methods for outlier detection in prediction , 2002 .

[25]  P. Rousseeuw,et al.  Alternatives to the Median Absolute Deviation , 1993 .

[26]  Sagarika Sahoo,et al.  A Short Review of the Generation of Molecular Descriptors and Their Applications in Quantitative Structure Property/Activity Relationships. , 2016, Current computer-aided drug design.

[27]  Ralph Kühne,et al.  External Validation and Prediction Employing the Predictive Squared Correlation Coefficient Test Set Activity Mean vs Training Set Activity Mean , 2008, J. Chem. Inf. Model..