论文信息 - gkm-DNN: efficient prediction using gapped k-mer features and deep neural networks - 字舞流文

gkm-DNN: efficient prediction using gapped k-mer features and deep neural networks

How to extract informative features from genome sequence is a challenging issue. Gapped k-mers frequency vectors (gkm-fv) has been presented as a new type of features in the last few years. Coupled with support vector machine (gkm-SVM), gkm-fvs have been used to achieve an effective sequence-based prediction (e.g., transcription factor binding site prediction). However, the huge computation of a large kernel matrix prevents it from using large amount of data. To this end, we proposed a flexible and scalable framework gkm-DNN to achieve feature representation and prediction from high-dimensional gkm-fvs using deep neural networks (DNN). We first implemented an efficient method to calculate the gkm-fv of a given sequence. We then adopted a DNN model with gkm-fvs as input to achieve a prediction task. Here, we took the transcription factor binding site prediction as an illustrative application. We applied gkm-DNN onto 467 small and 69 big human ENCODE ChIP-seq datasets to demonstrate its performance and compared it with the state-of-the-art method gkm-SVM. We demonstrated that gkm-DNN can not only overcome the drawbacks of high dimensionality, colinearity and sparsity of gkm-fvs, but also make comparable overall performance and distinct better accuracy compared with gkm-SVM in much shorter time. Moreover, gkm-DNN can be easily adapted to other applications and combine different types of data using computational graphs. Availability All source codes of gkm-DNN are available at http://page.amss.ac.cn/shihua.zhang/. Contact zsh@amss.ac.cn.

Shihua Zhang | Zhen Cao | Shihua Zhang | Zhen Cao

[1] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[2] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[4] William Stafford Noble,et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[5] ENCODEConsortium,et al. An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[6] J. Mattick,et al. Long non-coding RNAs: insights into functions , 2009, Nature Reviews Genetics.

[7] Mark Goadrich,et al. The relationship between Precision-Recall and ROC curves , 2006, ICML.

[8] M. Gerstein,et al. Annotating non-coding regions of the genome , 2010, Nature Reviews Genetics.

[9] Gary D. Stormo,et al. DNA binding sites: representation and discovery , 2000, Bioinform..

[10] Tao Wang,et al. Deep learning with COTS HPC systems , 2013, ICML.

[11] Tom Fawcett,et al. ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[12] Robert C. Holte,et al. What ROC Curves Can't Do (and Cost Curves Can) , 2004, ROCAI.

[13] P. Park. ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[14] Feng Liu,et al. PEDLA: predicting enhancers with a deep learning-based algorithmic framework , 2016, Scientific Reports.

[15] Yi Li,et al. Gene expression inference with deep learning , 2015, bioRxiv.

[16] Morteza Mohammad Noori,et al. gkmSVM: an R package for gapped-kmer SVM , 2016, Bioinform..

[17] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[18] Morteza Mohammad Noori,et al. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[19] Benjamin J. Strober,et al. A method to predict the impact of regulatory variants from DNA sequence , 2015, Nature Genetics.

[20] T. Mikkelsen,et al. The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[21] Julie L. Yang,et al. Affinity regression predicts the recognition code of nucleic acid binding proteins , 2015, Nature Biotechnology.

[22] Tatsunori B. Hashimoto,et al. Discovery of non-directional and directional pioneer transcription factors by modeling DNase profile magnitude and shape , 2014, Nature Biotechnology.

[23] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[24] Dongwon Lee,et al. LS-GKM: a new gkm-SVM for large-scale datasets , 2016, Bioinform..

[25] David R. Kelley,et al. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[26] B. Frey,et al. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[27] O. Troyanskaya,et al. Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[28] R. Real,et al. AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[29] Jianxing Feng,et al. Imputation for transcription factor binding predictions based on deep learning , 2017, PLoS Comput. Biol..

[30] Cisca Wijmenga,et al. Shared and distinct genetic variants in type 1 diabetes and celiac disease. , 2008, The New England journal of medicine.

[31] M. Gerstein,et al. Variation in Transcription Factor Binding Among Humans , 2010, Science.

[32] Wyeth W. Wasserman,et al. JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[33] Jürgen Schmidhuber,et al. Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[35] N. Bhardwaj,et al. Kernel-based machine learning protocol for predicting DNA-binding proteins , 2005, Nucleic acids research.