GRAM: A GeneRAlized Model to predict the molecular effect of a non-coding variant in a cell-type specific manner

There has been much effort to prioritize genomic variants with respect to their impact on “function”. However, function is often not precisely defined: Sometimes, it is the disease association of a variant; other times, it reflects a molecular effect on transcription or epigenetics. Here we coupled multiple genomic predictors to build GRAM, a generalized model, to predict a well-defined experimental target: the expression-modulating effect of a non-coding variant in a cell-specific manner. As a first step, we performed feature engineering: using a LASSO regularized linear model, we found transcription factor (TF) binding most predictive, especially for TFs that are hubs in the regulatory network; in contrast, evolutionary conservation, a popular feature in many other functional-impact predictors, has almost no contribution. Moreover, TF binding inferred from in vitro SELEX is as effective as that from in vivo ChIP-Seq. Second, we implemented GRAM integrating SELEX features and expression profiles. The program combines a universal regulatory score for a variant in a non-coding element with a modifier score reflecting the particular cell type. We benchmarked GRAM on a large-scale MPRA dataset in the GM12878 cell line, achieving a ROC score of ~0.73; performance on the K562 cell line was similar. Finally, we evaluated the performance of GRAM on targeted regions using luciferase assays in MCF7 and K562 cell lines. We noted that changing the insertion position of the construct relative to the reporter gene gives very different results, highlighting the importance of carefully defining the functional target the model is predicting. Author Summary Noncoding variants lie outside of protein-coding regions, and are found to have disease associations. However, knowledge on the molecular effect of these non-coding variants in a cell-specific context is very limited. Also, different output between multiple experiment platforms may introduce extra complexity in analyzing the molecular function of these variants. We developed GRAM, a generalized model to predict molecular effect of non-coding variants in multiple cell types for different experimental platforms. We first selected the most informative cell-independent SELEX transcription factor binding score on the variant locus as features and then combine cell-specific gene expression profile to build a multi-step prediction model. GRAM has been successfully tested on both MPRA and Luciferase assay, and on three different cell lines: GM12878, K562 and MCF7, shows high performance.

[1]  S. Kutateladze Traits , 2005, math/0507204.

[2]  Jie Xu,et al.  An integrative ENCODE resource for cancer genomics , 2019, Nature Communications.

[3]  Kai Wang,et al.  A semi-supervised approach for predicting cell-type specific functional consequences of non-coding variation using MPRAs , 2018, Nature Communications.

[4]  Nawaid Usmani,et al.  Fine-mapping of prostate cancer susceptibility loci in a large meta-analysis identifies candidate causal variants , 2018, Nature Communications.

[5]  Julie Thompson,et al.  Faculty Opinions recommendation of Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. , 2017 .

[6]  Chris P. Ponting,et al.  Biological function in the twilight zone of sequence conservation , 2017, BMC Biology.

[7]  Iuliana Ionita-Laza,et al.  FUN-LDA: A LATENT DIRICHLET ALLOCATION MODEL FOR PREDICTING TISSUE-SPECIFIC FUNCTIONAL EFFECTS OF NONCODING VARIATION , 2016, bioRxiv.

[8]  P. Visscher,et al.  10 Years of GWAS Discovery: Biology, Function, and Translation. , 2017, American journal of human genetics.

[9]  D. Gifford,et al.  A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction , 2017, bioRxiv.

[10]  Andres Metspalu,et al.  Constraints on eQTL Fine Mapping in the Presence of Multisite Local Regulation of Gene Expression , 2016, G3: Genes, Genomes, Genetics.

[11]  P. Stenson,et al.  The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies , 2017, Human Genetics.

[12]  Sharon R Grossman,et al.  Systematic dissection of genomic features determining transcription factor binding and enhancer function , 2017, Proceedings of the National Academy of Sciences.

[13]  S. Brunak,et al.  A scored human protein–protein interaction network to catalyze genomic interpretation , 2017, Nature Methods.

[14]  T. Mikkelsen,et al.  Genome-scale high-resolution mapping of activating and repressive nucleotides in regulatory regions , 2016, Nature Biotechnology.

[15]  Matthew D. Edwards,et al.  Accurate eQTL prioritization with an ensemble-based framework , 2016, bioRxiv.

[16]  F. Cunningham,et al.  The Ensembl Variant Effect Predictor , 2016, Genome Biology.

[17]  Jacob C. Ulirsch,et al.  Systematic Functional Dissection of Common Genetic Variation Affecting Red Blood Cell Traits , 2016, Cell.

[18]  Eric S. Lander,et al.  Direct Identification of Hundreds of Expression-Modulating Variants using a Multiplexed Reporter Assay , 2016, Cell.

[19]  J. Buxbaum,et al.  A SPECTRAL APPROACH INTEGRATING FUNCTIONAL GENOMIC ANNOTATIONS FOR CODING AND NONCODING VARIANTS , 2015, Nature Genetics.

[20]  Kara Dolinski,et al.  BioGRID: A Resource for Studying Biological Interactions in Yeast. , 2016, Cold Spring Harbor protocols.

[21]  Christian Gieger,et al.  Genetic fine-mapping and genomic annotation defines causal mechanisms at type 2 diabetes susceptibility loci , 2016 .

[22]  Qian Wang,et al.  Integrative Tissue-Specific Functional Annotations in the Human Genome Provide Novel Insights on Many Complex Traits and Improve Signal Prioritization in Genome Wide Association Studies , 2015, bioRxiv.

[23]  Gilles Louppe,et al.  Understanding Random Forests , 2015 .

[24]  N. Ahituv,et al.  Decoding enhancers using massively parallel reporter assays. , 2015, Genomics.

[25]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[26]  B. Pasaniuc,et al.  Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping Studies. , 2015, American journal of human genetics.

[27]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[28]  Maxwell R. Mumbach,et al.  Dynamic profiling of the protein life cycle in response to pathogens , 2015, Science.

[29]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..

[30]  M. Daly,et al.  Genetic and Epigenetic Fine-Mapping of Causal Autoimmune Disease Variants , 2014, Nature.

[31]  A. Siepel,et al.  Probabilities of Fitness Consequences for Point Mutations Across the Human Genome , 2014, Nature Genetics.

[32]  Matthew Mort,et al.  A Massively Parallel Pipeline to Clone DNA Variants and Examine Molecular Phenotypes of Human Disease Mutations , 2014, PLoS genetics.

[33]  Kevin Y. Yip,et al.  FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer , 2014, Genome Biology.

[34]  Gilles Louppe,et al.  Understanding Random Forests: From Theory to Practice , 2014, 1407.7502.

[35]  E. Zeggini,et al.  Functional annotation of non-coding sequence variants , 2014, Nature Methods.

[36]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[37]  P. Bickel,et al.  System wide analyses have underestimated protein abundances and the importance of transcription in mammals , 2012, PeerJ.

[38]  Gabor T. Marth,et al.  Integrative Annotation of Variants from 1092 Humans: Application to Cancer Genomics , 2013, Science.

[39]  T. Mikkelsen,et al.  Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. , 2013, Genome research.

[40]  Łukasz M. Boryń,et al.  Genome-Wide Quantitative Enhancer Activity Maps Identified by STARR-seq , 2013, Science.

[41]  S. Gabriel,et al.  Analysis of 6,515 exomes reveals a recent origin of most human protein-coding variants , 2012, Nature.

[42]  Christopher D. Brown,et al.  Integrative Modeling of eQTLs and Cis-Regulatory Elements Suggests Mechanisms Underlying Cell Type Specificity of eQTLs , 2012, PLoS genetics.

[43]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[44]  Shane J. Neph,et al.  Systematic Localization of Common Disease-Associated Variation in Regulatory DNA , 2012, Science.

[45]  Brock F. Binkowski,et al.  Engineered Luciferase Reporter from a Deep Sea Shrimp Utilizing a Novel Imidazopyrazinone Substrate , 2012, ACS chemical biology.

[46]  Joseph B Hiatt,et al.  Massively parallel functional dissection of mammalian enhancers in vivo , 2012, Nature Biotechnology.

[47]  T. Mikkelsen,et al.  Rapid dissection and model-based optimization of inducible enhancers in human cells using a massively parallel reporter assay , 2012, Nature Biotechnology.

[48]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[49]  Juan M. Vaquerizas,et al.  Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. , 2010, Genome research.

[50]  Mark Gerstein,et al.  Personal genome sequencing: current approaches and challenges. , 2010, Genes & development.

[51]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[52]  S. Smale,et al.  Luciferase assay. , 2010, Cold Spring Harbor protocols.

[53]  P. Stenson,et al.  The Human Gene Mutation Database: 2008 update , 2009, Genome Medicine.

[54]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[55]  Amy E. Hawkins,et al.  DNA sequencing of a cytogenetically normal acute myeloid leukemia genome , 2008, Nature.

[56]  Scott A. Rifkin,et al.  Revealing the architecture of gene regulation: the promise of eQTL studies. , 2008, Trends in genetics : TIG.

[57]  Mikhail A. Roytberg,et al.  Analysis of Sequence Conservation at Nucleotide Resolution , 2007, PLoS Comput. Biol..

[58]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[59]  F. Vesuna,et al.  Enhanced green fluorescent protein as an alternative control reporter to Renilla luciferase. , 2005, Analytical biochemistry.

[60]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[61]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[62]  D. Cooper,et al.  Human Gene Mutation Database , 1996, Human Genetics.