Modelling compound cytotoxicity using conformal prediction and PubChem HTS data.

The assessment of compound cytotoxicity is an important part of the drug discovery process. Accurate predictions of cytotoxicity have the potential to expedite decision making and save considerable time and effort. In this work we apply class conditional conformal prediction to model the cytotoxicity of compounds based on 16 high throughput cytotoxicity assays from PubChem. The data span 16 cell lines and comprise more than 440 000 unique compounds. The data sets are heavily imbalanced with only 0.8% of the tested compounds being cytotoxic. We trained one classification model for each cell line and validated the performance with respect to validity and accuracy. The generated models deliver high quality predictions for both toxic and non-toxic compounds despite the imbalance between the two classes. On external data collected from the same assay provider as one of the investigated cell lines the model had a sensitivity of 74% and a specificity of 65% at the 80% confidence level among the compounds assigned to a single class. Compared to previous approaches for large scale cytotoxicity modelling, this represents a balanced performance in the prediction of the toxic and non-toxic classes. The conformal prediction framework also allows the modeller to control the error frequency of the predictions, allowing predictions of cytotoxicity outcomes with confidence.

[1]  J. Kramer,et al.  The application of discovery toxicology and pathology towards the design of safer pharmaceutical lead candidates , 2007, Nature Reviews Drug Discovery.

[2]  Scott Boyer,et al.  The application of conformal prediction to the drug discovery process , 2013, Annals of Mathematics and Artificial Intelligence.

[3]  Vladimir Vovk,et al.  Conditional validity of inductive conformal predictors , 2012, Machine Learning.

[4]  Alexios Koutsoukas,et al.  Research data supporting "Improving the Prediction of Organism-level Toxicity through Integration of Chemical, Protein Target and Cytotoxicity qHTS Data" , 2016 .

[5]  Lars Carlsson,et al.  Modifications to p-Values of Conformal Predictors , 2015, SLDS.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Gerhard Klebe,et al.  Comparison of Automatic Three-Dimensional Model Builders Using 639 X-ray Structures , 1994, J. Chem. Inf. Comput. Sci..

[8]  Scott Boyer,et al.  Introducing conformal prediction in predictive modeling for regulatory purposes. A transparent and flexible alternative to applicability domain determination. , 2015, Regulatory toxicology and pharmacology : RTP.

[9]  Yanli Wang,et al.  PubChem BioAssay: 2014 update , 2013, Nucleic Acids Res..

[10]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[11]  Isidro Cortes-Ciriano,et al.  Cover Picture: How Consistent are Publicly Reported Cytotoxicity Data? Large‐Scale Statistical Analysis of the Concordance of Public Independent Cytotoxicity Measurements (ChemMedChem 1/2016) , 2016 .

[12]  Les Labuschagne,et al.  Cognitive Approaches for Digital Forensic Readiness Planning , 2013, IFIP Int. Conf. Digital Forensics.

[13]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[14]  Gordon M. Crippen,et al.  Data Mining the NCI60 to Predict Generalized Cytotoxicity. , 2008 .

[15]  Emilio Xavier Esposito,et al.  Oversampling to Overcome Overfitting: Exploring the Relationship between Data Set Composition, Molecular Descriptors, and Predictive Modeling Methods , 2013, J. Chem. Inf. Model..

[16]  I. Rusyn,et al.  Use of in Vitro HTS-Derived Concentration–Response Data as Biological Descriptors Improves the Accuracy of QSAR Models of in Vivo Toxicity , 2010, Environmental health perspectives.

[17]  W. Gasarch,et al.  The Book Review Column 1 Coverage Untyped Systems Simple Types Recursive Types Higher-order Systems General Impression 3 Organization, and Contents of the Book , 2022 .

[18]  Isidro Cortes-Ciriano,et al.  Improved large-scale prediction of growth inhibition patterns using the NCI60 cancer cell line panel , 2015, Bioinform..

[19]  Andreas Bender,et al.  Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance , 2004, J. Chem. Inf. Model..

[20]  Woody Sherman,et al.  Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods , 2010, J. Cheminformatics.

[21]  Zsolt Lorincz,et al.  A neural network based classification scheme for cytotoxicity predictions:Validation on 30,000 compounds. , 2006, Bioorganic & medicinal chemistry letters.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Henrik Boström,et al.  Bias reduction through conditional conformal prediction , 2015, Intell. Data Anal..

[24]  Andrew G. Garrow,et al.  The value of in silico chemistry in the safety assessment of chemicals in the consumer goods and pharmaceutical industries. , 2012, Drug discovery today.

[25]  Ruili Huang,et al.  Compound Cytotoxicity Profiling Using Quantitative High-Throughput Screening , 2007, Environmental health perspectives.

[26]  G. Bemis,et al.  The properties of known drugs. 1. Molecular frameworks. , 1996, Journal of medicinal chemistry.

[27]  Scott Boyer,et al.  Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination , 2014, J. Chem. Inf. Model..

[28]  Scott Boyer,et al.  Application of Conformal Prediction in QSAR , 2012, AIAI.

[29]  Lars Carlsson,et al.  Aggregated Conformal Prediction , 2014, AIAI Workshops.

[30]  Rajarshi Guha,et al.  Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays , 2008, J. Comput. Aided Mol. Des..

[31]  Lewis H. Mervin,et al.  Understanding Cytotoxicity and Cytostaticity in a High-Throughput Screening Collection. , 2016, ACS chemical biology.

[32]  Scott Boyer,et al.  Conformal Prediction Classification of a Large Data Set of Environmental Chemicals from ToxCast and Tox21 Estrogen Receptor Assays. , 2016, Chemical research in toxicology.

[33]  G. Shafer,et al.  Algorithmic Learning in a Random World , 2005 .

[34]  F. Fan,et al.  Bioluminescent assays for high-throughput screening. , 2007, Assay and drug development technologies.

[35]  Andreas Bender,et al.  Molecular Similarity Searching Using Atom Environments, Information-Based Feature Selection, and a Naïve Bayesian Classifier , 2004, J. Chem. Inf. Model..

[36]  Sarah R. Langdon,et al.  Predicting cytotoxicity from heterogeneous data sources with Bayesian learning , 2010, J. Cheminformatics.

[37]  J. Hughes,et al.  Physiochemical drug properties associated with in vivo toxicological outcomes. , 2008, Bioorganic & medicinal chemistry letters.

[38]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[39]  Henrik Boström,et al.  Efficiency Comparison of Unstable Transductive and Inductive Conformal Classifiers , 2014, AIAI Workshops.

[40]  Peter-Jürgen Kramer,et al.  Replacement of in vivo acute oral toxicity studies by in vitro cytotoxicity methods: opportunities, limits and regulatory status. , 2008, Regulatory toxicology and pharmacology : RTP.

[41]  Andreas Bender,et al.  How Consistent are Publicly Reported Cytotoxicity Data? Large‐Scale Statistical Analysis of the Concordance of Public Independent Cytotoxicity Measurements , 2016, ChemMedChem.