Undersampling: case studies of flaviviral inhibitory activities

Imbalanced datasets, comprising of more inactive compounds relative to the active ones, are a common challenge in ligand-based model building workflows for drug discovery. This is particularly true for neglected tropical diseases since efforts to identify therapeutics for these diseases are often limited. In this report, we analyze the performance of several undersampling strategies in modeling the Dengue Virus 2 (DENV2) inhibitory activity, as well as the anti-flaviviral activities for the West Nile (WNV) and Zika (ZIKV) viruses. To this end, we build datasets comprising of 1218 (159 actives and 1059 inactives), 1044 (132 actives and 912 inactives) and 302 (75 actives and 227 inactives) molecules with known DENV2, WNV and ZIKV inhibitory activity profiles, respectively. We develop ensemble classifiers for these endpoints and compare the performance of the different undersampling algorithms on external sets. It is observed that data pruning algorithms yield superior performance relative to data selection algorithms. The best overall performance is provided by the one-sided selection algorithm with test set balanced accuracy (BACC) values of 0.84, 0.74 and 0.77 for the DENV2, WNV and ZIKV inhibitory activities, respectively. For the model building, we use the recently proposed GT-STAF information indices, and compare the predictivity of 3 molecular fragmentation approaches: connected subgraphs, substructure and alogp atom types, which are observed to show comparable performance. On the other hand, a combination of indices based on these fragmentation strategies enhances the predictivity of the built ensembles. The built models could be useful for screening new molecules with possible DENV, WNV and ZIKV inhibitory activities. ADMET modelers are encouraged to adopt undersampling algorithms in their workflows when dealing with imbalanced datasets.

[1]  J. Gálvez,et al.  Event-based criteria in GT-STAF information indices: theory, exploratory diversity analysis and QSPR applications , 2013, SAR and QSAR in environmental research.

[2]  Artem Cherkasov,et al.  Towards Better BBB Passage Prediction Using an Extensive and Curated Data Set , 2015, Molecular Informatics.

[3]  P. Hotez,et al.  Control of neglected tropical diseases. , 2007, The New England journal of medicine.

[4]  Francisco Torrens,et al.  Relations frequency hypermatrices in mutual, conditional, and joint entropy‐based information indices , 2013, J. Comput. Chem..

[5]  G. Lushington,et al.  Inhibitors of Dengue virus and West Nile virus proteases based on the aminobenzamide scaffold. , 2012, Bioorganic & medicinal chemistry.

[6]  Yunqian Ma,et al.  Imbalanced Datasets: From Sampling to Classifiers , 2013 .

[7]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[8]  P. Pitisuttithum,et al.  Clinical efficacy and safety of a novel tetravalent dengue vaccine in healthy children in Asia: a phase 3, randomised, observer-masked, placebo-controlled trial , 2014, The Lancet.

[9]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[10]  Yovani Marrero-Ponce,et al.  IMMAN: free software for information theory-based chemometric analysis , 2015, Molecular Diversity.

[11]  S. Halstead,et al.  Secondary infection as a risk factor for dengue hemorrhagic fever/dengue shock syndrome: an historical perspective and role of antibody-dependent enhancement of infection , 2013, Archives of Virology.

[12]  Guoxun He,et al.  An Oversampling Expert System for Learing from Imbalanced Data Sets , .

[13]  Alex Alves Freitas,et al.  Coping with Unbalanced Class Data Sets in Oral Absorption Models , 2013, J. Chem. Inf. Model..

[14]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[15]  A. Fauci,et al.  Pandemic Zika: A Formidable Challenge to Medicine and Public Health , 2017, The Journal of infectious diseases.

[16]  Francisco Torrens,et al.  Shannon's, mutual, conditional and joint entropy information indices: generalization of global indices defined from local vertex invariants. , 2013, Current computer-aided drug design.

[17]  Wenyuan Wang,et al.  An Over-sampling Expert System for Learing from Imbalanced Data Sets , 2005, 2005 International Conference on Neural Networks and Brain.

[18]  Richard S. Judson,et al.  Binary Classification of a Large Collection of Environmental Chemicals from Estrogen Receptor Assays by Quantitative Structure-Activity Relationship and Machine Learning Methods , 2013, J. Chem. Inf. Model..

[19]  R. Bartenschlager,et al.  Discovery of Nanomolar Dengue and West Nile Virus Protease Inhibitors Containing a 4-Benzyloxyphenylglycine Residue. , 2015, Journal of medicinal chemistry.

[20]  Yovani Marrero-Ponce,et al.  Extended GT-STAF information indices based on Markov approximation models , 2013 .

[21]  Christoph Nitsche,et al.  The Medicinal Chemistry of Dengue Virus. , 2016, Journal of medicinal chemistry.

[22]  Dennis Normile,et al.  Tropical medicine. Surprising new dengue virus throws a spanner in disease control efforts. , 2013, Science.

[23]  J. Zupan,et al.  Structural and Physicochemical Interpretation of GT-STAF Information Theory-based Indices , 2015 .

[24]  Matheus P Freitas,et al.  Discrete Fourier Transform-Based Multivariate Image Analysis: Application to Modeling of Aromatase Inhibitory Activity. , 2018, ACS combinatorial science.

[25]  John R. Goodell,et al.  Identification of compounds with anti-West Nile Virus activity. , 2006, Journal of medicinal chemistry.

[26]  Ruili Huang,et al.  Identification of small-molecule inhibitors of Zika virus infection and induced neural cell death via a drug repurposing screen , 2016, Nature Medicine.

[27]  Karthik Gangavarapu,et al.  Genome sequencing reveals Zika virus diversity and spread in the Americas , 2017, bioRxiv.

[28]  Didier Sornette,et al.  Encyclopedia of Complexity and Systems Science , 2009 .

[29]  Emilio Benfenati,et al.  QSAR Modeling of ToxCast Assays Relevant to the Molecular Initiating Events of AOPs Leading to Hepatic Steatosis , 2018, J. Chem. Inf. Model..

[30]  John S. Brownstein,et al.  The global distribution and burden of dengue , 2013, Nature.

[31]  Yovani Marrero-Ponce,et al.  Derivatives in discrete mathematics: a novel graph-theoretical invariant for generating new 2/3D molecular descriptors. I. Theory and QSPR application , 2012, Journal of Computer-Aided Molecular Design.

[32]  Francisco Herrera,et al.  Learning from Imbalanced Data Sets , 2018, Springer International Publishing.

[33]  R. M. Owen,et al.  An analysis of the attrition of drug candidates from four major pharmaceutical companies , 2015, Nature Reviews Drug Discovery.

[34]  D. Normile Safety concerns derail dengue vaccination program. , 2017, Science.

[35]  Danail Bonchev,et al.  Trends in information theory-based chemical structure codification , 2014, Molecular Diversity.