Prioritizing transcriptomic and epigenomic experiments using an optimization strategy that leverages imputed data

Successful science often involves not only performing experiments well, but also choosing well among many possible experiments. In a hypothesis generation setting, choosing an experiment well means choosing an experiment whose results are interesting or novel. In this work, we formalize this selection procedure in the context of genomics and epigenomics data generation. Specifically, we consider the task faced by a scientific consortium such as the National Institutes of Health ENCODE Consortium, whose goal is to characterize all of the functional elements in the human genome. Given a list of possible cell types or tissue types (“biosamples”) and a list of possible high throughput sequencing assays, we ask “Which experiments should ENCODE perform next?” We demonstrate how to represent this task as an optimization problem, where the goal is to maximize the information gained in each successive experiment. Compared with previous work that has addressed a similar problem, our approach has the advantage that it can use imputed data to tailor the selected list of experiments based on data collected previously by the consortium. We demonstrate the utility of our proposed method in simulations, and we provide a general software framework, named Kiwano, for selecting genomic and epigenomic experiments.

[1]  Andreas Krause,et al.  Submodular Function Maximization , 2014, Tractability.

[2]  William Stafford Noble,et al.  Choosing panels of genomics assays using submodular optimization , 2016, Genome Biology.

[3]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[4]  Jacob M. Schreiber,et al.  Completing the ENCODE3 compendium yields accurate imputations across a variety of assays and human biosamples , 2019, Genome Biology.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[7]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[8]  William Stafford Noble,et al.  Choosing non‐redundant representative subsets of protein sequence data sets using submodular optimization , 2018, Proteins.

[9]  Martin Grötschel,et al.  Mathematical Programming The State of the Art, XIth International Symposium on Mathematical Programming, Bonn, Germany, August 23-27, 1982 , 1983, ISMP.

[10]  László Lovász,et al.  Submodular functions and convexity , 1982, ISMP.

[11]  Manolis Kellis,et al.  Large-scale epigenome imputation improves data quality and disease variant enrichment , 2015, Nature Biotechnology.

[12]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[13]  William Stafford Noble,et al.  PREDICTD PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition , 2018, Nature Communications.

[14]  Jeff A. Bilmes,et al.  Multi-scale deep tensor factorization learns a latent representation of the human epigenome , 2018, bioRxiv.

[15]  William Stafford Noble,et al.  apricot: Submodular selection for data summarization in Python , 2019, J. Mach. Learn. Res..

[16]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[17]  Jacob M. Schreiber,et al.  A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens , 2019, Cell.

[18]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[19]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[20]  Satoru Fujishige,et al.  Submodular functions and optimization , 1991 .