SpliceFinder: ab initio prediction of splice sites using convolutional neural network

Identifying splice sites is a necessary step to analyze the location and structure of genes. Two dinucleotides, GT and AG, are highly frequent on splice sites, and many other patterns are also on splice sites with important biological functions. Meanwhile, the dinucleotides occur frequently at the sequences without splice sites, which makes the prediction prone to generate false positives. Most existing tools select all the sequences with the two dimers and then focus on distinguishing the true splice sites from those pseudo ones. Such an approach will lead to a decrease in false positives; however, it will result in non-canonical splice sites missing. We have designed SpliceFinder based on convolutional neural network (CNN) to predict splice sites. To achieve the ab initio prediction, we used human genomic data to train our neural network. An iterative approach is adopted to reconstruct the dataset, which tackles the data unbalance problem and forces the model to learn more features of splice sites. The proposed CNN obtains the classification accuracy of 90.25%, which is 10% higher than the existing algorithms. The method outperforms other existing methods in terms of area under receiver operating characteristics (AUC), recall, precision, and F1 score. Furthermore, SpliceFinder can find the exact position of splice sites on long genomic sequences with a sliding window. Compared with other state-of-the-art splice site prediction tools, SpliceFinder generates results in about half lower false positive while keeping recall higher than 0.8. Also, SpliceFinder captures the non-canonical splice sites. In addition, SpliceFinder performs well on the genomic sequences of Drosophila melanogaster, Mus musculus, Rattus, and Danio rerio without retraining. Based on CNN, we have proposed a new ab initio splice site prediction tool, SpliceFinder, which generates less false positives and can detect non-canonical splice sites. Additionally, SpliceFinder is transferable to other species without retraining. The source code and additional materials are available at https://gitlab.deepomics.org/wangruohan/SpliceFinder.

[1]  J. Xue,et al.  The unusual 5′ splicing border GC is used in myrosinase genes of the Brassicaceae , 1995, Plant Molecular Biology.

[2]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  D. Hogness,et al.  The organization of the histone genes in Drosophila melanogaster: functional and evolutionary implications. , 1978, Cold Spring Harbor symposia on quantitative biology.

[5]  A. Krainer,et al.  Alternative Splicing of the Adenylyl Cyclase Stimulatory G-protein Gαs Is Regulated by SF2/ASF and Heterogeneous Nuclear Ribonucleoprotein A1 (hnRNPA1) and Involves the Use of an Unusual TG 3′-Splice Site* , 2002, The Journal of Biological Chemistry.

[6]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[7]  I. Jackson,et al.  A reappraisal of non-consensus mRNA splice sites. , 1991, Nucleic acids research.

[8]  Sungroh Yoon,et al.  Boosted Categorical Restricted Boltzmann Machine for Computational Prediction of Splice Junctions , 2015, ICML.

[9]  G. Ast,et al.  Alternative splicing and evolution: diversification, exon definition and function , 2010, Nature Reviews Genetics.

[10]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[11]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[12]  Gunnar Rätsch,et al.  Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning , 2006, PLoS Comput. Biol..

[13]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[14]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[15]  M. Hodge,et al.  Splicing of a yeast intron containing an unusual 5' junction sequence , 1989, Molecular and cellular biology.

[16]  Stephen M. Mount,et al.  A catalogue of splice junction sequences. , 1982, Nucleic acids research.

[17]  M. Forte,et al.  Two forms of Drosophila melanogaster Gs alpha are produced by alternate splicing involving an unusual splice site , 1990, Molecular and cellular biology.

[18]  Victor V. Solovyev,et al.  SpliceDB: database of canonical and non-canonical mammalian splice sites , 2001, Nucleic Acids Res..

[19]  Yvan Saeys,et al.  SpliceMachine: predicting splice sites from high-dimensional local context representations , 2005, Bioinform..

[20]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[21]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[22]  David G. Knowles,et al.  Predicting Splicing from Primary Sequence with Deep Learning , 2019, Cell.

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[25]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[26]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[27]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[28]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[29]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[30]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .

[31]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[32]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[33]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[34]  Wesley De Neve,et al.  SpliceRover: interpretable convolutional neural networks for improved splice site prediction , 2018, Bioinform..

[35]  Byunghan Lee,et al.  DNA-Level Splice Junction Prediction using Deep Recurrent Neural Networks , 2015, ArXiv.

[36]  Yves Van de Peer,et al.  ORCAE: online resource for community annotation of eukaryotes , 2012, Nature Methods.

[37]  Yu Li,et al.  Promoter analysis and prediction in the human genome using sequence-based deep learning models , 2019, Bioinform..

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[40]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[41]  C Benoist,et al.  Ovalbumin gene: evidence for a leader sequence in mRNA and DNA sequences at the exon-intron boundaries. , 1978, Proceedings of the National Academy of Sciences of the United States of America.