gkm-DNN: efficient prediction using gapped k-mer features and deep neural networks

How to extract informative features from genome sequence is a challenging issue. Gapped k-mers frequency vectors (gkm-fv) has been presented as a new type of features in the last few years. Coupled with support vector machine (gkm-SVM), gkm-fvs have been used to achieve an effective sequence-based prediction (e.g., transcription factor binding site prediction). However, the huge computation of a large kernel matrix prevents it from using large amount of data. To this end, we proposed a flexible and scalable framework gkm-DNN to achieve feature representation and prediction from high-dimensional gkm-fvs using deep neural networks (DNN). We first implemented an efficient method to calculate the gkm-fv of a given sequence. We then adopted a DNN model with gkm-fvs as input to achieve a prediction task. Here, we took the transcription factor binding site prediction as an illustrative application. We applied gkm-DNN onto 467 small and 69 big human ENCODE ChIP-seq datasets to demonstrate its performance and compared it with the state-of-the-art method gkm-SVM. We demonstrated that gkm-DNN can not only overcome the drawbacks of high dimensionality, colinearity and sparsity of gkm-fvs, but also make comparable overall performance and distinct better accuracy compared with gkm-SVM in much shorter time. Moreover, gkm-DNN can be easily adapted to other applications and combine different types of data using computational graphs. Availability All source codes of gkm-DNN are available at http://page.amss.ac.cn/shihua.zhang/. Contact zsh@amss.ac.cn.

[1]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[4]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[5]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[6]  J. Mattick,et al.  Long non-coding RNAs: insights into functions , 2009, Nature Reviews Genetics.

[7]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[8]  M. Gerstein,et al.  Annotating non-coding regions of the genome , 2010, Nature Reviews Genetics.

[9]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[10]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[11]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[12]  Robert C. Holte,et al.  What ROC Curves Can't Do (and Cost Curves Can) , 2004, ROCAI.

[13]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[14]  Feng Liu,et al.  PEDLA: predicting enhancers with a deep learning-based algorithmic framework , 2016, Scientific Reports.

[15]  Yi Li,et al.  Gene expression inference with deep learning , 2015, bioRxiv.

[16]  Morteza Mohammad Noori,et al.  gkmSVM: an R package for gapped-kmer SVM , 2016, Bioinform..

[17]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[18]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[19]  Benjamin J. Strober,et al.  A method to predict the impact of regulatory variants from DNA sequence , 2015, Nature Genetics.

[20]  T. Mikkelsen,et al.  The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[21]  Julie L. Yang,et al.  Affinity regression predicts the recognition code of nucleic acid binding proteins , 2015, Nature Biotechnology.

[22]  Tatsunori B. Hashimoto,et al.  Discovery of non-directional and directional pioneer transcription factors by modeling DNase profile magnitude and shape , 2014, Nature Biotechnology.

[23]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24]  Dongwon Lee,et al.  LS-GKM: a new gkm-SVM for large-scale datasets , 2016, Bioinform..

[25]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[26]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[27]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[28]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[29]  Jianxing Feng,et al.  Imputation for transcription factor binding predictions based on deep learning , 2017, PLoS Comput. Biol..

[30]  Cisca Wijmenga,et al.  Shared and distinct genetic variants in type 1 diabetes and celiac disease. , 2008, The New England journal of medicine.

[31]  M. Gerstein,et al.  Variation in Transcription Factor Binding Among Humans , 2010, Science.

[32]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[33]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[35]  N. Bhardwaj,et al.  Kernel-based machine learning protocol for predicting DNA-binding proteins , 2005, Nucleic acids research.