Active learning of enhancer and silencer regulatory grammar in photoreceptors

Cis-regulatory elements (CREs) direct gene expression in health and disease, and models that can accurately predict their activities from DNA sequences are crucial for biomedicine. Deep learning represents one emerging strategy to model the regulatory grammar that relates CRE sequence to function. However, these models require training data on a scale that exceeds the number of CREs in the genome. We address this problem using active machine learning to iteratively train models on multiple rounds of synthetic DNA sequences assayed in live mammalian retinas. During each round of training the model actively selects sequence perturbations to assay, thereby efficiently generating informative training data. We iteratively trained a model that predicts the activities of sequences containing binding motifs for the photoreceptor transcription factor Cone-rod homeobox (CRX) using an order of magnitude less training data than current approaches. The model’s internal confidence estimates of its predictions are reliable guides for designing sequences with high activity. The model correctly identified critical sequence differences between active and inactive sequences with nearly identical transcription factor binding sites, and revealed order and spacing preferences for combinations of motifs. Our results establish active learning as an effective method to train accurate deep learning models of cis-regulatory function after exhausting naturally occurring training examples in the genome.

[1]  Carl G. de Boer,et al.  LegNet: a best-in-class deep learning model for short DNA regulatory regions , 2023, bioRxiv.

[2]  B. Cohen,et al.  Pathogenic variants in CRX have distinct cis-regulatory effects on enhancers and silencers in photoreceptors , 2023, bioRxiv.

[3]  Carl G. de Boer,et al.  Hold out the genome: a roadmap to solving the cis-regulatory code , 2023, bioRxiv.

[4]  William Stafford Noble,et al.  Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types , 2023, bioRxiv.

[5]  B. Cohen,et al.  Transcription factor interactions explain the context-dependent activity of CRX binding sites , 2023, bioRxiv.

[6]  Rohit Singh,et al.  Prioritizing transcription factor perturbations from single-cell transcriptomics , 2023, bioRxiv.

[7]  Seungsoo Kim,et al.  Deciphering the multi-scale, quantitative cis-regulatory code. , 2023, Molecular cell.

[8]  S. Aerts,et al.  Enhancer grammar of liver cell types and hepatocyte zonation states , 2022, bioRxiv.

[9]  Daoqiang Zhang,et al.  Active learning for efficient analysis of high-throughput nanopore data , 2022, Bioinform..

[10]  Peter K. Koo,et al.  Correcting gradient-based interpretations of deep neural networks for genomics , 2022, bioRxiv.

[11]  S. Aerts,et al.  Cell-type-directed design of synthetic enhancers , 2022, bioRxiv.

[12]  A. Kundaje,et al.  Deciphering the impact of genetic variation on human polyadenylation using APARENT2 , 2022, bioRxiv.

[13]  A. Stark,et al.  DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers , 2022, Nature Genetics.

[14]  Francisco A. Cubillos,et al.  The evolution, evolvability and engineering of gene regulatory DNA , 2022, Nature.

[15]  Aaron Y. Lee,et al.  Machine Learning Prediction of Non-Coding Variant Impact in Human Retinal cis-Regulatory Elements , 2021, bioRxiv.

[16]  Georg Seelig,et al.  Fast activation maximization for molecular sequence design , 2021, BMC Bioinform..

[17]  Sergei V. Kalinin,et al.  Experimental discovery of structure–property relationships in ferroelectric materials via active learning , 2021, Nature Machine Intelligence.

[18]  Kathleen M. Chen,et al.  A sequence-based global map of regulatory activity for deciphering human genetics , 2021, Nature Genetics.

[19]  Kaitlyn M. Gaynor,et al.  Iterative human and automated identification of wildlife images , 2021, Nature Machine Intelligence.

[20]  Peter K. Koo,et al.  Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks , 2021, PLoS Comput. Biol..

[21]  Evan M. Cofer,et al.  Modeling transcriptional regulation of model species with deep learning , 2021, Genome research.

[22]  S. Aerts,et al.  Interpretation of allele-specific chromatin accessibility using cell state–aware deep learning , 2021, Genome research.

[23]  David R. Kelley,et al.  Effective gene expression prediction from sequence by integrating long-range interactions , 2021, Nature Methods.

[24]  Y. Satou,et al.  Cis-regulatory code for determining the action of Foxd as both an activator and a repressor in ascidian embryos. , 2021, Developmental biology.

[25]  P. Cramer,et al.  Sequence determinants of human gene regulatory elements , 2021, Nature Genetics.

[26]  Emma K. Farley,et al.  Enhancer grammar in development, evolution, and disease: dependencies and interplay. , 2021, Developmental cell.

[27]  B. Cohen,et al.  Information content differentiates enhancers from silencers in mouse photoreceptors , 2021, bioRxiv.

[28]  Shuhei A Horiguchi,et al.  Robotic search for optimal cell culture in regenerative medicine , 2020, bioRxiv.

[29]  Brian Hie,et al.  Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design. , 2020, Cell systems.

[30]  Michael J. Purcaro,et al.  Expanded encyclopaedias of DNA elements in the human and mouse genomes , 2020, Nature.

[31]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[32]  Matt Ploenzke,et al.  Improving representations of genomic sequence motifs in convolutional networks with exponential activations , 2020, Nature Machine Intelligence.

[33]  Georg Seelig,et al.  A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences , 2020, Cell systems.

[34]  Zachary W. Ulissi,et al.  Accelerated discovery of CO2 electrocatalysts using active machine learning , 2020, Nature.

[35]  M. Snyder,et al.  Systematic identification of silencers in human cells , 2020, Nature Genetics.

[36]  B. Cohen,et al.  Synthetic and genomic regulatory elements reveal aspects of cis-regulatory grammar in mouse embryonic stem cells , 2020, eLife.

[37]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[38]  Stephen S. Gisselbrecht,et al.  Transcriptional Silencers in Drosophila Serve a Dual Role as Transcriptional Enhancers in Alternate Cellular Contexts. , 2019, Molecular cell.

[39]  Avanti Shrikumar,et al.  Base-resolution models of transcription factor binding reveal soft motif syntax , 2019, Nature Genetics.

[40]  Johannes L. Schönberger,et al.  SciPy 1.0: fundamental algorithms for scientific computing in Python , 2019, Nature Methods.

[41]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[42]  J. Corbo,et al.  Cis-regulatory basis of sister cell type divergence in the vertebrate retina , 2019, bioRxiv.

[43]  Justin B. Kinney,et al.  Logomaker: beautiful sequence logos in Python , 2019, bioRxiv.

[44]  Ajit Singh,et al.  Machine Learning With Python , 2019 .

[45]  A. Swaroop,et al.  Targeted deletion of an NRL‐ and CRX‐regulated alternative promoter specifically silences FERM and PDZ domain containing 1 (Frmpd1) in rod photoreceptors , 2018, Human molecular genetics.

[46]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[47]  Olga G. Troyanskaya,et al.  Selene: a PyTorch-based deep learning library for biological sequence-level data , 2018, bioRxiv.

[48]  J. Corbo,et al.  A massively parallel reporter assay reveals context-dependent activity of homeodomain binding sites in vivo , 2018, Genome research.

[49]  Avanti Shrikumar,et al.  Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays , 2018, bioRxiv.

[50]  Zachary C. Lipton,et al.  Deep Bayesian Active Learning for Natural Language Processing: Results of a Large-Scale Empirical Study , 2018, EMNLP.

[51]  Yee Whye Teh,et al.  Neural Processes , 2018, ArXiv.

[52]  Hemangi G Chaudhari,et al.  Local sequence features that influence AP-1 cis-regulatory activity , 2018, Genome research.

[53]  Roman Garnett,et al.  Active Search for Computer‐aided Drug Design , 2018, Molecular informatics.

[54]  Carl G. de Boer,et al.  Deciphering eukaryotic gene-regulatory logic with 100 million random promoters , 2017, bioRxiv.

[55]  F. A. Kolpakov,et al.  HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis , 2017, Nucleic Acids Res..

[56]  Kristofer C. Berrett,et al.  Multiplex Enhancer Interference Reveals Collaborative Control of Gene Regulation by Estrogen Receptor α-Bound Enhancers. , 2017, Cell systems.

[57]  N. Jojic,et al.  Deep learning of the regulatory grammar of yeast 5′ untranslated regions from 500,000 random sequences , 2017, bioRxiv.

[58]  J. Corbo,et al.  Cell Type-Specific Epigenomic Analysis Reveals a Uniquely Closed Chromatin Architecture in Mouse Rod Photoreceptors , 2017, Scientific Reports.

[59]  Sharon R Grossman,et al.  Systematic dissection of genomic features determining transcription factor binding and enhancer function , 2017, Proceedings of the National Academy of Sciences.

[60]  Andrei A. Rusu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[61]  J. Wysocka,et al.  Ever-Changing Landscapes: Transcriptional Enhancers in Development and Evolution , 2016, Cell.

[62]  B. Cohen,et al.  A Simple Grammar Defines Activating and Repressing cis-Regulatory Elements in Photoreceptors. , 2016, Cell reports.

[63]  Jacqueline M. Dresch,et al.  Quantitative perturbation-based analysis of gene expression predicts enhancer activity in early Drosophila embryo , 2016, eLife.

[64]  David Jukam,et al.  Single–base pair differences in a shared motif determine differential Rhodopsin expression , 2015, Science.

[65]  Gerald Stampfel,et al.  Transcriptional regulators form diverse groups with context-dependent regulatory functions , 2015, Nature.

[66]  N. Tran,et al.  Graded gene expression changes determine phenotype severity in mouse models of CRX-associated retinopathies , 2015, Genome Biology.

[67]  François Laviolette,et al.  Algorithms for the Hard Pre-Image Problem of String Kernels and the General Problem of String Prediction , 2015, ICML.

[68]  Roman Garnett,et al.  Introducing the ‘active search’ method for iterative virtual screening , 2015, Journal of Computer-Aided Molecular Design.

[69]  B. Cohen,et al.  High-throughput functional testing of ENCODE segmentation predictions , 2014, Genome research.

[70]  J. Shendure,et al.  Massively parallel decoding of mammalian regulatory sequences supports a flexible organizational model , 2013, Nature Genetics.

[71]  B. Cohen,et al.  Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks , 2013, Proceedings of the National Academy of Sciences.

[72]  M. Proft,et al.  Activator and Repressor Functions of the Mot3 Transcription Factor in the Osmostress Response of Saccharomyces cerevisiae , 2013, Eukaryotic Cell.

[73]  Barak A. Cohen,et al.  Complex effects of nucleotide variants in a mammalian cis-regulatory element , 2012, Proceedings of the National Academy of Sciences.

[74]  E. Furlong,et al.  Transcription factors: from enhancer binding to developmental control , 2012, Nature Reviews Genetics.

[75]  J. Corbeil,et al.  Learning a peptide-protein binding affinity predictor with kernel ridge regression , 2012, BMC Bioinformatics.

[76]  E. Birney,et al.  A Transcription Factor Collective Defines Cardiac Cell Fate and Reflects Lineage History , 2012, Cell.

[77]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[78]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[79]  Daniel Schorderet,et al.  Nuclear Receptor Rev-erb Alpha (Nr1d1) Functions in Concert with Nr2e3 to Regulate Transcriptional Networks in the Retina , 2011, PloS one.

[80]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[81]  A. Swaroop,et al.  Transcriptional regulation of photoreceptor development and homeostasis in the mammalian retina , 2010, Nature Reviews Neuroscience.

[82]  Yue Zhao,et al.  Inferring Binding Energies from Selected Binding Sites , 2009, PLoS Comput. Biol..

[83]  A. Hennig,et al.  Regulation of photoreceptor gene expression by Crx-associated transcription factor network , 2008, Brain Research.

[84]  J. Corbo,et al.  The Cis-regulatory Logic of the Mammalian Photoreceptor Transcriptional Network , 2007, PloS one.

[85]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[86]  E. Strettoi,et al.  Transformation of cone precursors to functional rod photoreceptors by bZIP transcription factor NRL , 2007, Proceedings of the National Academy of Sciences.

[87]  David N Arnosti,et al.  Transcriptional enhancers: Intelligent enhanceosomes or flexible billboards? , 2005, Journal of cellular biochemistry.

[88]  O. Ahmad,et al.  The photoreceptor-specific nuclear receptor Nr2e3 interacts with Crx and exerts opposing effects on the transcription of rod versus cone genes. , 2005, Human molecular genetics.

[89]  Hemant Khanna,et al.  Photoreceptor-specific nuclear receptor NR2E3 functions as a transcriptional activator in rod photoreceptors. , 2004, Human molecular genetics.

[90]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[91]  H. Watada,et al.  The transcriptional repressor Nkx6.1 also functions as a deoxyribonucleic acid context-dependent transcriptional activator during pancreatic beta-cell differentiation: evidence for feedback activation of the nkx6.1 gene by Nkx6.1. , 2004, Molecular endocrinology.

[92]  Christopher H. Bryant,et al.  Functional genomic hypothesis generation and experimentation by a robot scientist , 2004, Nature.

[93]  Cyrille Alexandre,et al.  Requirements for transcriptional repression and activation by Engrailed in Drosophila embryos , 2003, Development.

[94]  Gunnar Rätsch,et al.  Active Learning with Support Vector Machines in the Drug Discovery Process , 2003, J. Chem. Inf. Comput. Sci..

[95]  Scott Barolo,et al.  Three habits of highly effective signaling pathways: principles of transcriptional control by developmental cell signaling. , 2002, Genes & development.

[96]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[97]  P. Sieving,et al.  Mutations in the Cone-Rod Homeobox Gene Are Associated with the Cone-Rod Dystrophy Photoreceptor Degeneration , 1997, Neuron.

[98]  C. Cepko,et al.  Crx, a Novel otx-like Homeobox Gene, Shows Photoreceptor-Specific Expression and Regulates Photoreceptor Differentiation , 1997, Cell.

[99]  Stephen W Scherer,et al.  Cone-Rod Dystrophy Due to Mutations in a Novel Photoreceptor-Specific Homeobox Gene ( CRX ) Essential for Maintenance of the Photoreceptor , 1997, Cell.

[100]  Donald J Zack,et al.  Crx, a Novel Otx-like Paired-Homeodomain Protein, Binds to and Transactivates Photoreceptor Cell-Specific Genes , 1997, Neuron.

[101]  Shlomo Argamon,et al.  Committee-Based Sampling For Training Probabilistic Classi(cid:12)ers , 1995 .

[102]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[103]  R. Garnett,et al.  Nonmyopic Multiclass Active Search for Diverse Discovery , 2022, ArXiv.

[104]  J. Corbo,et al.  Quantifying the activity of cis-regulatory elements in the mouse retina by explant electroporation. , 2013, Methods in molecular biology.

[105]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[106]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.