DeepSS: Exploring Splice Site Motif Through Convolutional Neural Network Directly From DNA Sequence

Splice sites prediction and interpretation are crucial to the understanding of complicated mechanisms underlying gene transcriptional regulation. Although existing computational approaches can classify true/false splice sites, the performance mostly relies on a set of sequence- or structure-based features and model interpretability is relatively weak. In viewing of these challenges, we report a deep learning-based framework (DeepSS), which consists of DeepSS-C module to classify splice sites and DeepSS-M module to detect splice sites sequence pattern. Unlike previous feature construction and model training process, DeepSS-C module accomplishes feature learning during the whole model training. Compared with state-of-the-art algorithms, experimental results show that the DeepSS-C module yields more accurate performance on six publicly donor/acceptor splice sites data sets. In addition, the parameters of the trained DeepSS-M module are used for model interpretation and downstream analysis, including: 1) genome factors detection (the truly relevant motifs that induce the related biological process happen) via filters from deep learning perspective; 2) analyzing the ability of CNN filters on motifs detection; 3) co-analysis of filters and motifs on DNA sequence pattern. DeepSS is freely available at http://ailab.ahu.edu.cn:8087/DeepSS/index.html.

[1]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[2]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[3]  Steven Salzberg,et al.  Finding Genes in DNA with a Hidden Markov Model , 1997, J. Comput. Biol..

[4]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[5]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[6]  Bruce Draper,et al.  Feature selection from huge feature sets in the context of computer vision , 2000 .

[7]  Jason Tsong-Li Wang,et al.  Effective hidden Markov models for detecting splicing junction sites in DNA sequences , 2001, Inf. Sci..

[8]  Gunnar Rätsch,et al.  New Methods for Splice Site Recognition , 2002, ICANN.

[9]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[10]  Salvatore Rampone,et al.  Hs3d, A Dataset Of Homo Sapiens Splice Regions, And Its Extraction Procedure From A Major Public Database , 2002 .

[11]  A. Sandelin,et al.  Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes , 2003, Functional & Integrative Genomics.

[12]  L. Chasin,et al.  The effect of nonsense codons on splicing: a genomic analysis. , 2003, RNA.

[13]  Loi Sy Ho,et al.  Splice site detection with a higher-order markov model implemented on a neural network. , 2003, Genome informatics. International Conference on Genome Informatics.

[14]  Christopher B. Burge,et al.  Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals , 2003, RECOMB '03.

[15]  B. Schölkopf,et al.  Accurate Splice Site Detection for Caenorhabditis elegans , 2004 .

[16]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[17]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[18]  Gabriele Steidl,et al.  Combined SVM-Based Feature Selection and Classification , 2005, Machine Learning.

[19]  Gunnar Rätsch,et al.  RASE: recognition of alternatively spliced exons in C.elegans , 2005, ISMB.

[20]  G. Stormo,et al.  Computational technique for improvement of the position-weight matrices for the DNA/protein binding sites , 2005, Nucleic acids research.

[21]  Ron Shamir,et al.  Accurate identification of alternatively spliced exons using support vector machine , 2005, Bioinform..

[22]  Saman K. Halgamuge,et al.  Splice site identification using probabilistic parameters and SVM classification , 2006 .

[23]  J. Huang,et al.  An approach of encoding for prediction of splice sites using SVM. , 2006, Biochimie.

[24]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[25]  Yixin Chen,et al.  Splice site prediction using support vector machines with a Bayes kernel , 2006, Expert Syst. Appl..

[26]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[27]  R. Sachidanandam,et al.  Comprehensive splice-site analysis using comparative genomics , 2006, Nucleic acids research.

[28]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[29]  Heitor Silvério Lopes,et al.  A Configware Approach for High-Speed Parallel Analysis of genomic Data , 2007, J. Circuits Syst. Comput..

[30]  Saman K. Halgamuge,et al.  Fast splice site detection using information content and feature reduction , 2008, BMC Bioinformatics.

[31]  Hugo Larochelle,et al.  Efficient Learning of Deep Boltzmann Machines , 2010, AISTATS.

[32]  Jing Li,et al.  Splice sites prediction of Human genome using length-variable Markov model and feature selection , 2010, Expert Syst. Appl..

[33]  Vassilis Koutkias,et al.  SpliceIT: A hybrid method for splice signal identification based on probabilistic and biological inference , 2010, J. Biomed. Informatics.

[34]  Timon Schroeter,et al.  Visual Interpretation of Kernel‐Based Prediction Models , 2011, Molecular informatics.

[35]  Jens Keilwagen,et al.  Jstacs: A Java Framework for Statistical Analysis and Classification of Biological Sequences , 2012, J. Mach. Learn. Res..

[36]  Jürgen Schmidhuber,et al.  Transfer learning for Latin and Chinese characters with Deep Neural Networks , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[37]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[38]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[39]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.

[40]  Pascal Vincent,et al.  Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives , 2012, ArXiv.

[41]  Qingshan Jiang,et al.  A New Classification Method for Human Gene Splice Site Prediction , 2012, HIS.

[42]  张慧玲,et al.  A Novel Splice Site Prediction Method using Support Vector Machine , 2013 .

[43]  Yoshua Bengio,et al.  Regularized Auto-Encoders Estimate Local Statistics , 2012, ICLR.

[44]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[45]  K. De Jong,et al.  Effective Automated Feature Construction and Selection for Classification of Biological Sequences , 2014, PloS one.

[46]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[47]  Li Deng,et al.  Deep Dynamic Models for Learning Hidden Representations of Speech Features , 2014 .

[48]  Deepak Garg,et al.  Hybrid Approach Using SVM and MM2 in Splice Site Junction Identification , 2014 .

[49]  Prabina Kumar Meher,et al.  A statistical approach for 5′ splice site prediction using short sequence motifs and without encoding sequence data , 2014, BMC Bioinformatics.

[50]  Byeong-Soo Jeong,et al.  Effective DNA Encoding for Splice Site Prediction Using SVM , 2014 .

[51]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[52]  Neelam Goel,et al.  An Improved Method for Splice Site Prediction in DNA Sequences Using Support Vector Machines , 2015 .

[53]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[54]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[55]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[56]  Nizamettin Aydin,et al.  Splice sites prediction of human genome using AdaBoost , 2016, 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI).

[57]  Pashaei Elham,et al.  Prediction of splice site using AdaBoost with a new sequence encoding approach , 2016 .

[58]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[59]  Prabina Kumar Meher,et al.  Prediction of donor splice sites using random forest with a new sequence encoding approach , 2016, BioData Mining.

[60]  Charles Elkan,et al.  Learning to Diagnose with LSTM Recurrent Neural Networks , 2015, ICLR.

[61]  Ning Chen,et al.  DeepEnhancer: Predicting enhancers by convolutional neural networks , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[62]  Prabina Kumar Meher,et al.  Identification of donor splice sites using support vector machine: a computational approach based on positional, compositional and dependency features , 2016, Algorithms for Molecular Biology.

[63]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[64]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[65]  Nizamettin Aydin,et al.  Splice site identification in human genome using random forest , 2016, Health and Technology.