The identification of cis-regulatory elements: A review from a machine learning perspective

The majority of the human genome consists of non-coding regions that have been called junk DNA. However, recent studies have unveiled that these regions contain cis-regulatory elements, such as promoters, enhancers, silencers, insulators, etc. These regulatory elements can play crucial roles in controlling gene expressions in specific cell types, conditions, and developmental stages. Disruption to these regions could contribute to phenotype changes. Precisely identifying regulatory elements is key to deciphering the mechanisms underlying transcriptional regulation. Cis-regulatory events are complex processes that involve chromatin accessibility, transcription factor binding, DNA methylation, histone modifications, and the interactions between them. The development of next-generation sequencing techniques has allowed us to capture these genomic features in depth. Applied analysis of genome sequences for clinical genetics has increased the urgency for detecting these regions. However, the complexity of cis-regulatory events and the deluge of sequencing data require accurate and efficient computational approaches, in particular, machine learning techniques. In this review, we describe machine learning approaches for predicting transcription factor binding sites, enhancers, and promoters, primarily driven by next-generation sequencing data. Data sources are provided in order to facilitate testing of novel methods. The purpose of this review is to attract computational experts and data scientists to advance this field.

[1]  Dean Alderucci A SPECTRAL ALGORITHM FOR LEARNING HIDDEN MARKOV MODELS THAT HAVE SILENT STATES , 2015 .

[2]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[3]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[4]  Łukasz M. Boryń,et al.  Genome-Wide Quantitative Enhancer Activity Maps Identified by STARR-seq , 2013, Science.

[5]  W. Wasserman,et al.  Identification of altered cis-regulatory elements in human disease. , 2015, Trends in genetics : TIG.

[6]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[7]  G. Hon,et al.  Predictive chromatin signatures in the mammalian genome. , 2009, Human molecular genetics.

[8]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[9]  M. W. Johnson,et al.  Quantum annealing with manufactured spins , 2011, Nature.

[10]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[11]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[12]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[13]  Wendy A Bickmore,et al.  The spatial organization of the human genome. , 2013, Annual review of genomics and human genetics.

[14]  S. Brenner,et al.  A survey of ancient conserved non-coding elements in the PAX6 locus reveals a landscape of interdigitated cis-regulatory archipelagos. , 2014, Developmental biology.

[15]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[19]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[20]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[21]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  C. Glass,et al.  Enhancer RNAs and regulated transcriptional programs. , 2014, Trends in biochemical sciences.

[24]  Alioune Ngom,et al.  The Max-Min High-Order Dynamic Bayesian Network for Learning Gene Regulatory Networks with Time-Delayed Regulations , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  V. Iyer,et al.  FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. , 2007, Genome research.

[26]  Youlian Pan Advances in the Discovery of cis-Regulatory Elements , 2006 .

[27]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[28]  Manolis Kellis,et al.  Discovery and Characterization of Chromatin States for Systematic Annotation of the Human Genome , 2011, RECOMB.

[29]  A. Visel,et al.  Disruptions of Topological Chromatin Domains Cause Pathogenic Rewiring of Gene-Enhancer Interactions , 2015, Cell.

[30]  Yaochu Jin,et al.  Reconstructing biological gene regulatory networks: where optimization meets big data , 2014, Evol. Intell..

[31]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[32]  G. Bejerano,et al.  Enhancers: five essential questions , 2013, Nature Reviews Genetics.

[33]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[34]  Michael Hecker,et al.  Gene regulatory network inference: Data integration in dynamic models - A review , 2009, Biosyst..

[35]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[36]  Razvan Pascanu,et al.  Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  M. Gerstein,et al.  Annotating non-coding regions of the genome , 2010, Nature Reviews Genetics.

[38]  Alioune Ngom,et al.  Sparse representation approaches for the classification of high-dimensional biological data , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.

[39]  T. Meehan,et al.  An atlas of active enhancers across human cell types and tissues , 2014, Nature.

[40]  Zhihua Zhang,et al.  Computational Identification of Active Enhancers in Model Organisms , 2013, Genom. Proteom. Bioinform..

[41]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[42]  Inna Dubchak,et al.  VISTA Enhancer Browser—a database of tissue-specific human enhancers , 2006, Nucleic Acids Res..

[43]  Leighton J. Core,et al.  Nascent RNA Sequencing Reveals Widespread Pausing and Divergent Initiation at Human Promoters , 2008, Science.

[44]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[45]  Cesare Furlanello,et al.  A promoter-level mammalian expression atlas , 2015 .

[46]  Katherine S. Pollard,et al.  Integrating Diverse Datasets Improves Developmental Enhancer Prediction , 2013, PLoS Comput. Biol..

[47]  David A. Orlando,et al.  Selective Inhibition of Tumor Oncogenes by Disruption of Super-Enhancers , 2013, Cell.

[48]  Wei Xie,et al.  RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State , 2013, PLoS Comput. Biol..

[49]  R. Shiekhattar,et al.  Enhancer RNAs: the new molecules of transcription. , 2014, Current opinion in genetics & development.

[50]  E. Liu,et al.  An Oestrogen Receptor α-bound Human Chromatin Interactome , 2009, Nature.

[51]  Philip S. Yu,et al.  Is random model better? On its accuracy and efficiency , 2003, Third IEEE International Conference on Data Mining.

[52]  Lin Yang,et al.  GBshape: a genome browser database for DNA shape annotations , 2014, Nucleic Acids Res..

[53]  David J. Arenillas,et al.  JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles , 2013, Nucleic Acids Res..

[54]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[55]  A. McCallion,et al.  Genomics of long-range regulatory elements. , 2010, Annual review of genomics and human genetics.

[56]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[57]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[58]  Kevin C. Chen,et al.  Spectacle: fast chromatin state annotation using spectral learning , 2015, Genome Biology.

[59]  Wyeth W. Wasserman,et al.  Deep Feature Selection: Theory and Application to Identify Enhancers and Promoters , 2015, RECOMB.

[60]  Robert Petryszak,et al.  ArrayExpress update—simplifying data submissions , 2014, Nucleic Acids Res..

[61]  Gerald Stampfel,et al.  Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features , 2014, Genome research.

[62]  Z. Weng,et al.  High-Resolution Mapping and Characterization of Open Chromatin across the Genome , 2008, Cell.

[63]  R. Young,et al.  Super-Enhancers in the Control of Cell Identity and Disease , 2013, Cell.

[64]  David Maxwell Chickering,et al.  Learning Bayesian Networks is , 1994 .

[65]  L. Ettwiller,et al.  Functional and topological characteristics of mammalian regulatory domains , 2014, Genome research.

[66]  K. Morris,et al.  The rise of regulatory RNA , 2014, Nature Reviews Genetics.

[67]  Stephanie L. Hyland,et al.  Identification of active transcriptional regulatory elements with GRO-seq , 2015, Nature Methods.

[68]  Alioune Ngom,et al.  Sparse machine learning models in bioinformatics , 2014 .

[69]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[70]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .

[71]  Jennifer A. Mitchell,et al.  Enhancer identification in mouse embryonic stem cells using integrative modeling of chromatin and genomic features , 2012, BMC Genomics.

[72]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[73]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[74]  Jesse R. Dixon,et al.  Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions , 2012, Nature.

[75]  V. Corces,et al.  Enhancers: emerging roles in cell fate specification , 2012, EMBO reports.

[76]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[77]  Manolis Kellis,et al.  ChromHMM: automating chromatin-state discovery and characterization , 2012, Nature Methods.

[78]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[79]  Bernhard Schölkopf,et al.  Support Vector Machine Applications in Computational Biology , 2004 .

[80]  William Stafford Noble,et al.  Integrative annotation of chromatin elements from ENCODE data , 2012, Nucleic acids research.

[81]  Jesse R. Raab,et al.  Insulators and promoters: closer than we think , 2010, Nature Reviews Genetics.

[82]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[83]  A. Sandelin,et al.  Metazoan promoters: emerging characteristics and insights into transcriptional regulation , 2012, Nature Reviews Genetics.

[84]  J. Stamatoyannopoulos,et al.  Genomic discovery of potent chromatin insulators for human gene therapy , 2015, Nature Biotechnology.

[85]  Yoshua Bengio,et al.  Learning deep physiological models of affect , 2013, IEEE Computational Intelligence Magazine.

[86]  Kevin Y. Yip,et al.  Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors , 2012, Genome Biology.

[87]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[88]  A. Stark,et al.  Deciphering the transcriptional cis-regulatory code. , 2013, Trends in genetics : TIG.

[89]  A. Stark,et al.  Transcriptional enhancers: from properties to genome-wide predictions , 2014, Nature Reviews Genetics.

[90]  William Stafiord Noble,et al.  Support vector machine applications in computational biology , 2004 .

[91]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[92]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[93]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[94]  R. Andersson Promoter or enhancer, what's the difference? Deconstruction of established distinctions and presentation of a unifying model , 2015, BioEssays : news and reviews in molecular, cellular and developmental biology.

[95]  V. Corces,et al.  Enhancer function: new insights into the regulation of tissue-specific gene expression , 2011, Nature Reviews Genetics.

[96]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[97]  André L. Martins,et al.  Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers , 2014, Nature Genetics.

[98]  D. Duboule,et al.  Structure, function and evolution of topologically associating domains (TADs) atHOX loci , 2015, FEBS letters.

[99]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[100]  V. Bajic,et al.  DEEP: a general computational framework for predicting enhancers , 2014, Nucleic acids research.

[101]  David A. Orlando,et al.  Master Transcription Factors and Mediator Establish Super-Enhancers at Key Cell Identity Genes , 2013, Cell.

[102]  Jill M Dowen,et al.  Control of Cell Identity Genes Occurs in Insulated Neighborhoods in Mammalian Chromosomes , 2014, Cell.

[103]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[104]  Anima Anandkumar,et al.  A Method of Moments for Mixture Models and Hidden Markov Models , 2012, COLT.