What Do Neighbors Tell About You: The Local Context of Cis-Regulatory Modules Complicates Prediction of Regulatory Variants

Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent “Regulation Saturation” Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the “information leakage” caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_regsat and https://genomeinterpretation.org/content/expression-variants.

[1]  D. Landsman,et al.  Statistical analysis of over-represented words in human promoter sequences. , 2004, Nucleic acids research.

[2]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  Barak A. Cohen,et al.  Complex effects of nucleotide variants in a mammalian cis-regulatory element , 2012, Proceedings of the National Academy of Sciences.

[5]  Robert-Jan Palstra,et al.  HERC2 rs12913832 modulates human pigmentation by attenuating chromatin-loop formation between a long-range enhancer and the OCA2 promoter. , 2012, Genome research.

[6]  Joseph B Hiatt,et al.  Massively parallel functional dissection of mammalian enhancers in vivo , 2012, Nature Biotechnology.

[7]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[8]  T. Mikkelsen,et al.  Rapid dissection and model-based optimization of inducible enhancers in human cells using a massively parallel reporter assay , 2012, Nature Biotechnology.

[9]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[10]  B. Cohen,et al.  Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks , 2013, Proceedings of the National Academy of Sciences.

[11]  Łukasz M. Boryń,et al.  Genome-Wide Quantitative Enhancer Activity Maps Identified by STARR-seq , 2013, Science.

[12]  J. Shendure,et al.  Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model , 2013, Nature Genetics.

[13]  B. Cohen,et al.  Massively parallel synthetic promoter assays reveal the in vivo effects of binding site variants , 2013, Genome research.

[14]  Mary Goldman,et al.  The UCSC Genome Browser database: extensions and updates 2013 , 2012, Nucleic Acids Res..

[15]  Gary L. Gallia,et al.  TERT promoter mutations occur frequently in gliomas and a subset of tumors derived from cells with low rates of self-renewal , 2013, Proceedings of the National Academy of Sciences.

[16]  T. Mikkelsen,et al.  Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. , 2013, Genome research.

[17]  Fidel Ramírez,et al.  deepTools: a flexible platform for exploring deep-sequencing data , 2014, Nucleic Acids Res..

[18]  Vsevolod J. Makeev,et al.  PERFECTOS-APE - Predicting Regulatory Functional Effect of SNPs by Approximate P-value Estimation , 2015, BIOINFORMATICS.

[19]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[20]  Matthew C. Canver,et al.  BCL11A enhancer dissection by Cas9-mediated in situ saturating mutagenesis , 2015, Nature.

[21]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[22]  Matthew D. Edwards,et al.  High-throughput mapping of regulatory DNA , 2016, Nature Biotechnology.

[23]  K. Pollard,et al.  Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin , 2016, Nature Genetics.

[24]  B. Deplancke,et al.  The Genetics of Transcription Factor DNA Binding Variation , 2016, Cell.

[25]  S. Aerts,et al.  Identification of cis-regulatory mutations generating de novo edges in personalized cancer gene regulatory networks , 2017, Genome Medicine.

[26]  A. Serretti,et al.  Role of 108 schizophrenia‐associated loci in modulating psychopathological dimensions in schizophrenia and bipolar disorder , 2017, American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics.

[27]  Yuwen Liu,et al.  Systematic identification of regulatory variants associated with cancer risk , 2017, Genome Biology.

[28]  Alexander E. Kel,et al.  GTRD: a database of transcription factor binding sites identified by ChIP-seq experiments , 2016, Nucleic Acids Res..

[29]  T. Skaar,et al.  High‐Throughput Assays to Assess the Functional Impact of Genetic Variants: A Road Towards Genomic‐Driven Medicine , 2017, Clinical and translational science.

[30]  Michael A. Beer,et al.  Local epigenomic state cannot discriminate interacting and non-interacting enhancer–promoter pairs with high accuracy , 2018, bioRxiv.

[31]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[32]  F. A. Kolpakov,et al.  HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis , 2017, Nucleic Acids Res..

[33]  Anshul Kundaje,et al.  Discovering epistatic feature interactions from neural network models of regulatory DNA sequences , 2018, bioRxiv.

[34]  Fedor A. Kolpakov,et al.  GTRD: a database on gene transcription regulation—2019 update , 2018, Nucleic Acids Res..

[35]  Nir Yosef,et al.  Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay , 2019, Human mutation.

[36]  Wenqiang Shi,et al.  Gene expression models based on transcription factor binding events confer insight into functional cis-regulatory variants , 2018, Bioinform..

[37]  Jeff A. Bilmes,et al.  A pitfall for machine learning methods aiming to predict across cell types , 2019, Genome Biology.