Two New Parameters Based on Distances in a Receiver Operating Characteristic Chart for the Selection of Classification Models

There are several indices that provide an indication of different types on the performance of QSAR classification models, being the area under a Receiver Operating Characteristic (ROC) curve still the most powerful test to overall assess such performance. All ROC related parameters can be calculated for both the training and test sets, but, nevertheless, neither of them constitutes an absolute indicator of the classification performance by themselves. Moreover, one of the biggest drawbacks is the computing time needed to obtain the area under the ROC curve, which naturally slows down any calculation algorithm. The present study proposes two new parameters based on distances in a ROC curve for the selection of classification models with an appropriate balance in both training and test sets, namely the following: the ROC graph Euclidean distance (ROCED) and the ROC graph Euclidean distance corrected with Fitness Function (FIT(λ)) (ROCFIT). The behavior of these indices was observed through the study on the mutagenicity for four genotoxicity end points of a number of nonaromatic halogenated derivatives. It was found that the ROCED parameter gets a better balance between sensitivity and specificity for both the training and prediction sets than other indices such as the Matthews correlation coefficient, the Wilk's lambda, or parameters like the area under the ROC curve. However, when the ROCED parameter was used, the follow-on linear discriminant models showed the lower statistical significance. But the other parameter, ROCFIT, maintains the ROCED capabilities while improving the significance of the models due to the inclusion of FIT(λ).

[1]  R. Benigni Structure-activity relationship studies of chemical mutagens and carcinogens: mechanistic investigations and prediction approaches. , 2005, Chemical reviews.

[2]  Ramesh V Kumar,et al.  A Review of Methods and Applications of the ROC Curve in Clinical Trials , 2010 .

[3]  Xue-Gang Yang,et al.  In silico prediction and screening of γ‐secretase inhibitors by molecular descriptors and machine learning methods , 2009, J. Comput. Chem..

[4]  Robert P. Sheridan,et al.  Comparison of Topological, Shape, and Docking Methods in Virtual Screening , 2007, J. Chem. Inf. Model..

[5]  P Gramatica,et al.  QSAR and chemometric approaches for setting water quality objectives for dangerous chemicals. , 2001, Ecotoxicology and environmental safety.

[6]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[7]  Robert P. Sheridan,et al.  Protocols for Bridging the Peptide to Nonpeptide Gap in Topological Similarity Searches , 2001, J. Chem. Inf. Comput. Sci..

[8]  James W. McFarland,et al.  On Identifying Likely Determinants of Biological Activity in High Dimensional QSAR Problems , 1994 .

[9]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[10]  Ajay N. Jain,et al.  Parameter estimation for scoring protein-ligand interactions using negative training data. , 2006, Journal of medicinal chemistry.

[11]  P. Charifson,et al.  Improved scoring of ligand-protein interactions using OWFEG free energy grids. , 2001, Journal of medicinal chemistry.

[12]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[13]  Scott D. Kahn,et al.  Current Status of Methods for Defining the Applicability Domain of (Quantitative) Structure-Activity Relationships , 2005, Alternatives to laboratory animals : ATLA.

[14]  Andrea Rizzi,et al.  Virtual Screening Using PLS Discriminant Analysis and ROC Curve Approach: An Application Study on PDE4 Inhibitors , 2008, J. Chem. Inf. Model..

[15]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[16]  Aliuska Morales Helguera,et al.  QSAR models to predict mutagenicity of acrylates, methacrylates and alpha,beta-unsaturated carbonyl compounds. , 2010, Dental materials : official publication of the Academy of Dental Materials.

[17]  C F Hildebolt,et al.  Statistical analysis with receiver operating characteristic curves. , 1992, Radiology.

[18]  Romualdo Benigni,et al.  Predictivity and Reliability of QSAR Models: The Case of Mutagens and Carcinogens , 2008, Toxicology mechanisms and methods.

[19]  M Natália D S Cordeiro,et al.  A topological substructural molecular design approach for predicting mutagenesis end-points of alpha, beta-unsaturated carbonyl compounds. , 2010, Toxicology.

[20]  J. Hanley Receiver operating characteristic (ROC) methodology: the state of the art. , 1989, Critical reviews in diagnostic imaging.

[21]  A. Tropsha,et al.  Beware of q2! , 2002, Journal of molecular graphics & modelling.

[22]  Meir Glick,et al.  Application of Machine Learning To Improve the Results of High-Throughput Docking Against the HIV-1 Protease , 2004, J. Chem. Inf. Model..

[23]  J A Swets,et al.  Better decisions through science. , 2000, Scientific American.

[24]  Anthony E. Klon,et al.  Finding more needles in the haystack: A simple and efficient method for improving high-throughput docking results. , 2004, Journal of medicinal chemistry.

[25]  Ernesto Estrada,et al.  Automatic extraction of structural alerts for predicting chromosome aberrations of organic compounds. , 2006, Journal of molecular graphics & modelling.

[26]  Ajay N. Jain,et al.  Robust ligand-based modeling of the biological targets of known drugs. , 2006, Journal of medicinal chemistry.

[27]  Christopher I. Bayly,et al.  Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem , 2007, J. Chem. Inf. Model..

[28]  J. Pin,et al.  Virtual screening workflow development guided by the "receiver operating characteristic" curve approach. Application to high-throughput docking on metabotropic glutamate receptor subtype 4. , 2005, Journal of medicinal chemistry.

[29]  C E Metz,et al.  Evaluation of receiver operating characteristic curve data in terms of information theory, with applications in radiography. , 1973, Radiology.

[30]  A. Tropsha,et al.  Beware of q 2 , 2002 .

[31]  Paola Gramatica,et al.  Introduction General Considerations , 2022 .

[32]  Jianjing Cao,et al.  Novel azido and isothiocyanato analogues of [3-(4-phenylalkylpiperazin-1-yl)propyl]bis(4-fluorophenyl)amines as potential irreversible ligands for the dopamine transporter. , 2004, Journal of medicinal chemistry.