Sequence based prediction of enhancer regions from DNA random walk

Regulatory elements play a critical role in development process of eukaryotic organisms by controlling the spatio-temporal pattern of gene expression. Enhancer is one of these elements which contributes to the regulation of gene expression through chromatin loop or eRNA expression. Experimental identification of a novel enhancer is a costly exercise, due to which there is an interest in computational approaches to predict enhancer regions in a genome. Existing computational approaches to achieve this goal have primarily been based on training of high-throughput data such as transcription factor binding sites (TFBS), DNA methylation, and histone modification marks etc. On the other hand, purely sequence based approaches to predict enhancer regions are promising as they are not biased by the complexity or context specificity of such datasets. In sequence based approaches, machine learning models are either directly trained on sequences or sequence features, to classify sequences as enhancers or non-enhancers. In this paper, we derived statistical and nonlinear dynamic features along with k-mer features from experimentally validated sequences taken from Vista Enhancer Browser through random walk model and applied different machine learning based methods to predict whether an input test sequence is enhancer or not. Experimental results demonstrate the success of proposed model based on Ensemble method with area under curve (AUC) 0.86, 0.89, and 0.87 in B cells, T cells, and Natural killer cells for histone marks dataset.

[1]  L. Arnold,et al.  Lyapunov exponents: A survey , 1986 .

[2]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[3]  Chao Ren,et al.  BiRen: predicting enhancers with a deep‐learning‐based model using the DNA sequence alone , 2017, Bioinform..

[4]  Nicole Rusk Genomics: Predicting enhancers by their sequence , 2014, Nature Methods.

[5]  H. Stanley,et al.  Time-dependent Hurst exponent in financial time series , 2004 .

[6]  Wei Xie,et al.  RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State , 2013, PLoS Comput. Biol..

[7]  Yiming Lu,et al.  DELTA: A Distal Enhancer Locating Tool Based on AdaBoost Algorithm and Shape Features of Chromatin Modifications , 2015, PloS one.

[8]  Jean-Jack M Riethoven,et al.  Regulatory regions in DNA: promoters, enhancers, silencers, and insulators. , 2010, Methods in molecular biology.

[9]  Dongwon Lee,et al.  kmer-SVM: a web server for identifying predictive regulatory sequence features in genomic data sets , 2013, Nucleic Acids Res..

[10]  G. Loots Genomic identification of regulatory elements by evolutionary sequence comparison and functional analysis. , 2008, Advances in genetics.

[11]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[12]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[13]  Ananth Grama,et al.  EP-DNN: A Deep Neural Network-Based Global Enhancer Prediction Algorithm , 2016, Scientific Reports.

[14]  Andreas W. Kempa-Liehr,et al.  Distributed and parallel time series feature extraction for industrial big data applications , 2016, ArXiv.

[15]  Bing He,et al.  EnhancerAtlas: a resource for enhancer annotation and analysis in 105 human cell/tissue types , 2016, Bioinform..

[16]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[17]  Fang Huang,et al.  eRFSVM: a hybrid classifier to predict enhancers-integrating random forests with support vector machines , 2016, Hereditas.

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  P. M. Leong,et al.  Random walk and gap plots of DNA sequences , 1995, Comput. Appl. Biosci..

[20]  Ian T. Jolliffe,et al.  Graphical Representation of Data Using Principal Components , 1986 .

[21]  Cangzhi Jia,et al.  EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features , 2016, Scientific Reports.

[22]  G. Bejerano,et al.  Enhancers: five essential questions , 2013, Nature Reviews Genetics.

[23]  Morteza Mohammad Noori,et al.  gkmSVM: an R package for gapped-kmer SVM , 2016, Bioinform..

[24]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[25]  V. Bajic,et al.  DEEP: a general computational framework for predicting enhancers , 2014, Nucleic acids research.

[26]  Edwin Smith,et al.  Enhancer biology and enhanceropathies , 2014, Nature Structural &Molecular Biology.

[27]  H E Stanley,et al.  Scaling features of noncoding DNA. , 1999, Physica A.

[28]  Feng Liu,et al.  PEDLA: predicting enhancers with a deep learning-based algorithmic framework , 2016, Scientific Reports.

[29]  A. Dean,et al.  Enhancer function: mechanistic and genome-wide insights come together. , 2014, Molecular cell.

[30]  Panos Kalnis,et al.  Progress and challenges in bioinformatics approaches for enhancer identification , 2015, Briefings Bioinform..

[31]  Michael Fernández,et al.  Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector machines , 2012, Nucleic acids research.

[32]  G. Santhosh Kumar DNA Sequence Representation methods , 2009 .

[33]  Jan M. Ruijter,et al.  EMERGE: a flexible modelling framework to predict genomic regulatory elements from genomic signatures , 2015, Nucleic acids research.

[34]  Suraiya Jabin,et al.  Poker hand classification , 2016, 2016 International Conference on Computing, Communication and Automation (ICCCA).

[35]  Yang Wang,et al.  A new method for enhancer prediction based on deep belief network , 2017, BMC Bioinformatics.

[36]  Suraiya Jabin,et al.  Stock Market Prediction using Feed-forward Artificial Neural Network , 2014 .

[37]  V. Solovyev,et al.  Nucleotide patterns aiding in prediction of eukaryotic promoters , 2017, PloS one.

[38]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[39]  G van den Engh,et al.  Estimating genomic distance from DNA sequence location in cell nuclei by a random walk model. , 1992, Science.

[40]  D. Dickel,et al.  Improved regulatory element prediction based on tissue-specific local epigenomic signatures , 2017, Proceedings of the National Academy of Sciences.

[41]  A. Visel,et al.  ChIP-seq accurately predicts tissue-specific activity of enhancers , 2009, Nature.