Generating hard-to-obtain information from easy-to-obtain information: Applications in drug discovery and clinical inference

In many important contexts involving measurements of biological entities, there are distinct categories of information: some information is easy-to-obtain information (EI) and can be gathered on virtually every subject of interest, while other information is hard-to-obtain information (HI) and can only be gathered on some of the biological samples. For example, in the context of drug discovery, measurements like the chemical structure of a drug are EI, while measurements of the transcriptome of a cell population perturbed with the drug is HI. In the clinical context, basic health monitoring is EI because it is already being captured as part of other processes, while cellular measurements like flow cytometry or even ultimate patient outcome are HI. We propose building a model to make probabilistic predictions of HI from EI on the samples that have both kinds of measurements, which will allow us to generalize and predict the HI on a large set of samples from just the EI. To accomplish this, we present a conditional Generative Adversarial Network (cGAN) framework we call the Feature Mapping GAN (FMGAN). By using the EI as conditions to map to the HI, we demonstrate that FMGAN can accurately predict the HI, with heterogeneity in cases of distributions of HI from EI. We show that FMGAN is flexible in that it can learn rich and complex mappings from EI to HI, and can take into account manifold structure in the EI space where available. We demonstrate this in a variety of contexts including generating RNA sequencing results on cell lines subjected to drug perturbations using drug chemical structure, and generating clinical outcomes from patient lab measurements. Most notably, we are able to generate synthetic flow cytometry data from clinical variables on a cohort of COVID-19 patients—effectively describing their immune response in great detail, and showcasing the power of generating expensive FACS data from ubiquitously available patient monitoring data. Bigger Picture Many experiments face a trade-off between gathering easy-to-collect information on many samples or hard-to-collect information on a smaller number of small due to costs in terms of both money and time. We demonstrate that a mapping between the easy-to-collect and hard-to-collect information can be trained as a conditional GAN from a subset of samples with both measured. With our conditional GAN model known as Feature-Mapping GAN (FMGAN), the results of expensive experiments can be predicted, saving on the costs of actually performing the experiment. This can have major impact in many settinsg. We study two example settings. First, in the field of pharmaceutical drug discovery early phase pharmaceutical experiments require casting a wide net to find a few potential leads to follow. In the long term, development pipelines can be re-designed to specifically utilize FMGAN in an optimal way to accelerate the process of drug discovery. FMGAN can also have a major impact in clinical setting, where routinely measured variables like blood pressure or heart rate can be used to predict important health outcomes and therefore deciding the best course of treatment.

[1]  Edward W. Lowe,et al.  Computational Methods in Drug Discovery , 2014, Pharmacological Reviews.

[2]  Fabian J Theis,et al.  Diffusion pseudotime robustly reconstructs lineage branching , 2016, Nature Methods.

[3]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[4]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[5]  Sean C. Bendall,et al.  Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis , 2015, Cell.

[6]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[7]  Daniel Smilkov,et al.  Similar image search for histopathology: SMILY , 2019, npj Digital Medicine.

[8]  Masahiro Morita,et al.  Distinct perturbation of the translatome by the antidiabetic drug metformin , 2012, Proceedings of the National Academy of Sciences.

[9]  Eric Song,et al.  Longitudinal analyses reveal immunological misfiring in severe COVID-19 , 2020, Nature.

[10]  David van Dijk,et al.  Manifold learning-based methods for analyzing single-cell RNA-sequencing data , 2018 .

[11]  Gabriel Kreiman,et al.  Unsupervised Learning of Visual Structure using Predictive Generative Networks , 2015, ArXiv.

[12]  Chris Sander,et al.  Perturbation biology nominates upstream–downstream drug combinations in RAF inhibitor resistant melanoma cells , 2015, eLife.

[13]  Ronald R. Coifman,et al.  Visualizing structure and transitions in high-dimensional biological data , 2019, Nature Biotechnology.

[14]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[15]  Smita Krishnaswamy,et al.  TraVeLGAN: Image-To-Image Translation by Transformation Vector Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Marcus Gastreich,et al.  The next level in chemical space navigation: going far beyond enumerable compound libraries. , 2019, Drug discovery today.

[17]  Michael E. Houle,et al.  Dimensionality, Discriminability, Density and Distance Distributions , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[18]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[19]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[20]  Ranadip Pal,et al.  Inference of tumor inhibition pathways from drug perturbation data , 2013, 2013 IEEE Global Conference on Signal and Information Processing.

[21]  W. Knaus,et al.  APACHE II: a severity of disease classification system. , 1985 .

[22]  Angela N. Brooks,et al.  A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles , 2017, Cell.

[23]  Alistair E. W. Johnson,et al.  The eICU Collaborative Research Database, a freely available multi-center database for critical care research , 2018, Scientific Data.

[24]  Patrick Wong,et al.  Longitudinal immunological analyses reveal inflammatory misfiring in severe COVID-19 patients , 2020, medRxiv.

[25]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[26]  Wei Chen,et al.  Unsupervised Neural Machine Translation with Weight Sharing , 2018 .

[27]  S. Stoytchev,et al.  Development and validation of the COVID-19 severity index (CSI): a prognostic tool for early respiratory decompensation , 2020, medRxiv.

[28]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[29]  Jacob Abernethy,et al.  On Convergence and Stability of GANs , 2018 .

[30]  Kevin R. Moon,et al.  Exploring single-cell data with deep multitasking neural networks , 2017, Nature Methods.

[31]  E. Draper,et al.  APACHE II: A severity of disease classification system , 1985, Critical care medicine.

[32]  Jost Tobias Springenberg,et al.  Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks , 2015, ICLR.

[33]  F. Hock Drug Discovery and Evaluation: Pharmacological Assays , 2016, Springer International Publishing.

[34]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[35]  B. Nadler,et al.  Diffusion maps, spectral clustering and reaction coordinates of dynamical systems , 2005, math/0503445.

[36]  Sebastian Nowozin,et al.  Which Training Methods for GANs do actually Converge? , 2018, ICML.

[37]  Kevin R. Moon,et al.  Recovering Gene Interactions from Single-Cell Data Using Data Diffusion , 2018, Cell.

[38]  James Zou,et al.  Feedback GAN (FBGAN) for DNA: a Novel Feedback-Loop Architecture for Optimizing Protein Functions , 2018, ArXiv.