Reliably assessing prediction reliability for high dimensional QSAR data

Predictability and prediction reliability are of utmost important to characterize a good Quantitative structure–activity relationships (QSAR) model. However, validation methods are insufficient to guarantee the prediction reliability of QSAR models. Moreover, high dimensional samples also pose great challenge to traditional methods in terms of predictive power. Therefore, this study presents a predictive classifier (i.e., TreeEC) that can assess prediction reliability with high confidence, especially for facing high dimensional QSAR data. Two approaches for assessing prediction reliability are provided, i.e., applicability domain and prediction confidence. We demonstrate that the applicability domain has difficulty to guarantee the models’ prediction reliability, where samples intensively close to the domain center are often poor predicted than those outside the domain. Instead, prediction confidence is more promising for assessing prediction reliability. Based on a large data set assessed by prediction confidence, external samples assessed with high confidence greater than 95 % can be reliably predicted with an accuracy of 94 %, in contrast to the average accuracy of 84 %. We also illustrate that TreeEC are less affected by high dimensionality than other popular methods according to 11 public data sets. A free version of TreeEC with a user-friendly interface can also be downloading from website http://pharminfo.zju.edu.cn/computation/TreeEC/TreeEC.html.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Nina Nikolova-Jeliazkova,et al.  QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review , 2005, Alternatives to laboratory animals : ATLA.

[3]  Yiyu Cheng,et al.  Identifying P-Glycoprotein Substrates Using a Support Vector Machine Optimized by a Particle Swarm , 2007, J. Chem. Inf. Model..

[4]  P. Jurs,et al.  Development and use of charged partial surface area structural descriptors in computer-assisted quantitative structure-property relationship studies , 1990 .

[5]  Li Shao,et al.  Consensus Ranking Approach to Understanding the Underlying Mechanism With QSAR , 2010, J. Chem. Inf. Model..

[6]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[7]  Hua Yuan,et al.  Prediction of Skin Sensitization with a Particle Swarm Optimized Support Vector Machine , 2009, International journal of molecular sciences.

[8]  H. Mewes,et al.  Can we estimate the accuracy of ADME-Tox predictions? , 2006, Drug discovery today.

[9]  Tudor I. Oprea,et al.  hERG classification model based on a combination of support vector machine method and GRIND descriptors. , 2008, Molecular pharmaceutics.

[10]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[11]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[12]  Gergana Dimitrova,et al.  A Stepwise Approach for Defining the Applicability Domain of SAR and QSAR Models , 2005, J. Chem. Inf. Model..

[13]  R. Clarke,et al.  Approaches to working in high-dimensional data spaces: gene expression microarrays , 2008, British Journal of Cancer.

[14]  E. Gehan,et al.  The properties of high-dimensional data spaces: implications for exploring gene and protein expression data , 2008, Nature Reviews Cancer.

[15]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[16]  Yi Li,et al.  In silico ADME/Tox: why models fail , 2003, J. Comput. Aided Mol. Des..

[17]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[18]  Weida Tong,et al.  Assessment of Prediction Confidence and Domain Extrapolation of Two Structure–Activity Relationship Models for Predicting Estrogen Receptor Binding Activity , 2004, Environmental health perspectives.

[19]  Hong Fang,et al.  Decision forest for classification of gene expression data , 2010, Comput. Biol. Medicine.

[20]  Stephen R. Johnson,et al.  The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy) , 2008, J. Chem. Inf. Model..

[21]  Humayun Kabir,et al.  Comparative Studies on Some Metrics for External Validation of QSPR Models , 2012, J. Chem. Inf. Model..

[22]  Arthur M. Doweyko,et al.  QSAR: dead or alive? , 2008, J. Comput. Aided Mol. Des..

[23]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Paola Gramatica,et al.  Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. , 2003, Environmental health perspectives.

[26]  Z R Li,et al.  MODEL—molecular descriptor lab: A web‐based server for computing structural and physicochemical features of compounds , 2007, Biotechnology and bioengineering.

[27]  Peter C Jurs,et al.  Assessing the reliability of a QSAR model's predictions. , 2005, Journal of molecular graphics & modelling.

[28]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[29]  J. Topliss,et al.  Chance correlations in structure-activity studies using multiple regression analysis , 1972 .

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[32]  Xiaohui Fan,et al.  Why QSAR fails: an empirical evaluation using conventional computational approach. , 2011, Molecular pharmaceutics.

[33]  Douglas M. Hawkins,et al.  Assessing Model Fit by Cross-Validation , 2003, J. Chem. Inf. Comput. Sci..

[34]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[35]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Gerald M. Maggiora,et al.  On Outliers and Activity Cliffs-Why QSAR Often Disappoints , 2006, J. Chem. Inf. Model..