Conformal Prediction Classification of a Large Data Set of Environmental Chemicals from ToxCast and Tox21 Estrogen Receptor Assays.

Quantitative structure-activity relationships (QSAR) are critical to exploitation of the chemical information in toxicology databases. Exploitation can be extraction of chemical knowledge from the data but also making predictions of new chemicals based on quantitative analysis of past findings. In this study, we analyzed the ToxCast and Tox21 estrogen receptor data sets using Conformal Prediction to enhance the full exploitation of the information in these data sets. We applied aggregated conformal prediction (ACP) to the ToxCast and Tox21 estrogen receptor data sets using support vector machine classifiers to compare overall performance of the models but, more importantly, to explore the performance of ACP on data sets that are significantly enriched in one class without employing sampling strategies of the training set. ACP was also used to investigate the problem of applicability domain using both data sets. Comparison of ACP to previous results obtained on the same data sets using traditional QSAR approaches indicated similar overall balanced performance to methods in which careful training set selections were made, e.g., sensitivity and specificity for the external Tox21 data set of 70-75% and far superior results to those obtained using traditional methods without training set sampling where the corresponding results showed a clear imbalance of 50 and 96%, respectively. Application of conformal prediction to imbalanced data sets facilitates an unambiguous analysis of all data, allows accurate predictive models to be built which display similar accuracy in external validation to external validation, and, most importantly, allows an unambiguous treatment of the applicability domain.