Spliceator: multi-species splice site prediction using convolutional neural networks

Background Ab initio prediction of splice sites is an essential step in eukaryotic genome annotation. Recent predictors have exploited Deep Learning algorithms and reliable gene structures from model organisms. However, Deep Learning methods for non-model organisms are lacking. Results We developed Spliceator to predict splice sites in a wide range of species, including model and non-model organisms. Spliceator uses a convolutional neural network and is trained on carefully validated data from over 100 organisms. We show that Spliceator achieves consistently high accuracy (89–92%) compared to existing methods on independent benchmarks from human, fish, fly, worm, plant and protist organisms. Conclusions Spliceator is a new Deep Learning method trained on high-quality data, which can be used to predict splice sites in diverse organisms, ranging from human to protists, with consistently high accuracy.

[1]  Ian Korf,et al.  Gene finding in novel genomes , 2004, BMC Bioinformatics.

[2]  Allison J. Taggart,et al.  The effects of structure on pre-mRNA processing and stability. , 2017, Methods.

[3]  A. Krainer,et al.  Listening to silence and understanding nonsense: exonic mutations that affect splicing , 2002, Nature Reviews Genetics.

[4]  Chung-Chin Lu,et al.  Prediction of splice sites with dependency graphs and their expanded bayesian networks , 2005, Bioinform..

[5]  David Haussler,et al.  Improved splice site detection in Genie , 1997, RECOMB '97.

[6]  J. Thompson,et al.  A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms , 2020, BMC Genomics.

[7]  Binhua Tang,et al.  Recent Advances of Deep Learning in Bioinformatics and Computational Biology , 2019, Front. Genet..

[8]  Yu Li,et al.  Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. , 2019, Methods.

[9]  Astrid Gall,et al.  Ensembl 2020 , 2019, Nucleic Acids Res..

[10]  David G. Knowles,et al.  Predicting Splicing from Primary Sequence with Deep Learning , 2019, Cell.

[11]  L. Feuk,et al.  Global and unbiased detection of splice junctions from RNA-seq data , 2010, Genome Biology.

[12]  Liran Carmel,et al.  Origin and evolution of spliceosomal introns , 2012, Biology Direct.

[13]  Wilfried Haerty,et al.  Genome-wide discovery of human splicing branchpoints , 2015, Genome research.

[14]  Nizamettin Aydin,et al.  A novel method for splice sites prediction using sequence component and hidden Markov model , 2016, 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[15]  Salvatore Rampone,et al.  Hs3d, A Dataset Of Homo Sapiens Splice Regions, And Its Extraction Procedure From A Major Public Database , 2002 .

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  Wesley De Neve,et al.  SpliceRover: interpretable convolutional neural networks for improved splice site prediction , 2018, Bioinform..

[18]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[19]  P. Bork,et al.  Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation , 2021, Nucleic Acids Res..

[20]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[21]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[22]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[23]  V. Solovyev,et al.  Analysis of canonical and non-canonical splice sites in mammalian genomes. , 2000, Nucleic acids research.

[24]  Abramowicz Anna,et al.  Splicing mutations in human genetic disorders: examples, detection, and confirmation , 2018, Journal of Applied Genetics.

[25]  S. Salzberg,et al.  GeneSplicer: a new computational method for splice site prediction. , 2001, Nucleic acids research.

[26]  Stephen M. Mount,et al.  Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. , 2003, Nucleic acids research.

[27]  Boas Pucker,et al.  Consideration of non-canonical splice sites improves gene prediction on the Arabidopsis thaliana Niederzenz-1 genome sequence , 2017, BMC Research Notes.

[28]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[29]  C. Gooding,et al.  A class of human exons with predicted distant branch points revealed by analysis of AG dinucleotide exclusion zones , 2006, Genome Biology.

[30]  Olivier Poch,et al.  OrthoInspector 3.0: open portal for comparative genomics , 2018, Nucleic Acids Res..

[31]  Yvan Saeys,et al.  SpliceMachine: predicting splice sites from high-dimensional local context representations , 2005, Bioinform..

[32]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[33]  Tatsuhiko Naito,et al.  Human Splice-Site Prediction with Deep Neural Networks , 2018, J. Comput. Biol..

[34]  Deepak Garg,et al.  Hybrid Approach Using SVM and MM2 in Splice Site Junction Identification , 2014 .

[35]  Boas Pucker,et al.  Animal, Fungi, and Plant Genome Sequences Harbor Different Non-Canonical Splice Sites , 2020, Cells.

[36]  Britta Hartmann,et al.  Genome-wide Analysis of Alternative Pre-mRNA Splicing* , 2008, Journal of Biological Chemistry.

[37]  A. Bateman,et al.  Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases , 2019, Nucleic acids research.

[38]  Ghazaleh Khodabandelou,et al.  Genome annotation across species using deep convolutional neural networks , 2020, PeerJ Comput. Sci..

[39]  Bernard De Baets,et al.  Feature subset selection for splice site prediction , 2002, ECCB.

[40]  Lefteris Koumakis,et al.  Deep learning models in genomics; are we there yet? , 2020, Computational and structural biotechnology journal.

[41]  Yvan Saeys,et al.  Digging into Acceptor Splice Site Prediction: An Iterative Feature Selection Approach , 2004, PKDD.

[42]  Steven L Salzberg,et al.  Next-generation genome annotation: we still struggle to get it right , 2019, Genome Biology.

[43]  K. Pruitt,et al.  P8008 The NCBI Eukaryotic Genome Annotation Pipeline , 2016 .

[44]  Ying He,et al.  A survey on deep learning in DNA/RNA motif mining , 2020, Briefings Bioinform..

[45]  BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database , 2021, NAR genomics and bioinformatics.

[46]  A. Gregory Matera,et al.  A day in the life of the spliceosome , 2014, Nature Reviews Molecular Cell Biology.

[47]  Boas Pucker,et al.  Genome-wide analyses supported by RNA-Seq reveal non-canonical splice sites in plant genomes , 2018, BMC Genomics.

[48]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[49]  Kinji Ohno,et al.  Human branch point consensus sequence is yUnAy , 2008, Nucleic acids research.

[50]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[51]  Ruohan Wang,et al.  SpliceFinder: ab initio prediction of splice sites using convolutional neural network , 2019, BMC Bioinformatics.

[52]  R. Sachidanandam,et al.  Comprehensive splice-site analysis using comparative genomics , 2006, Nucleic acids research.

[53]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[54]  Michael Q. Zhang,et al.  RNA landscape of evolution for optimal exon and intron discrimination , 2008, Proceedings of the National Academy of Sciences.

[55]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[56]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[57]  Gunnar Rätsch,et al.  Accurate splice site prediction using support vector machines , 2007, BMC Bioinformatics.

[58]  Jiuyong Xie,et al.  The matrices and constraints of GT/AG splice sites of more than 1000 species/lineages. , 2018, Gene.

[59]  M. Yandell,et al.  Genome Annotation and Curation Using MAKER and MAKER‐P , 2014, Current protocols in bioinformatics.

[60]  Prabina Kumar Meher,et al.  Prediction of donor splice sites using random forest with a new sequence encoding approach , 2016, BioData Mining.

[61]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[62]  Peter B. McGarvey,et al.  UniProt: the universal protein knowledgebase in 2021 , 2020, Nucleic Acids Res..

[63]  Victor V. Solovyev,et al.  SpliceDB: database of canonical and non-canonical mammalian splice sites , 2001, Nucleic Acids Res..

[64]  Olivier Poch,et al.  Understanding the causes of errors in eukaryotic protein-coding gene prediction: a case study of primate proteomes , 2020, BMC Bioinform..

[65]  Yuan Chen,et al.  A high-performance approach for predicting donor splice sites based on short window size and imbalanced large samples , 2019, Biology Direct.

[66]  Burkhard Morgenstern,et al.  Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources , 2006, BMC Bioinformatics.

[67]  Jing Li,et al.  Splice sites prediction of Human genome using length-variable Markov model and feature selection , 2010, Expert Syst. Appl..

[68]  M. Kilkenny,et al.  Data quality: “Garbage in – garbage out” , 2018, Health information management : journal of the Health Information Management Association of Australia.

[69]  Carlos González,et al.  Heterochromatin protein 1α interacts with parallel RNA and DNA G-quadruplexes , 2019, Nucleic acids research.

[70]  Mohammed AlQuraishi,et al.  AlphaFold at CASP13 , 2019, Bioinform..

[71]  Murray Campbell,et al.  Deep Blue , 2002, Artif. Intell..

[72]  Olivier Poch,et al.  PipeAlign: a new toolkit for protein family analysis , 2003, Nucleic Acids Res..

[73]  Mario Stanke,et al.  BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database , 2020, bioRxiv.

[74]  Nizamettin Aydin,et al.  Splice site identification in human genome using random forest , 2016, Health and Technology.

[75]  Shona Murphy,et al.  Transcription and splicing: A two‐way street , 2020, Wiley interdisciplinary reviews. RNA.

[76]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[77]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[78]  Farhad Pourpanah,et al.  Recent advances in deep learning , 2020, International Journal of Machine Learning and Cybernetics.

[79]  Christopher B. Burge,et al.  Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals , 2004, J. Comput. Biol..

[80]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[81]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[82]  Shuye Tian,et al.  Modern deep learning in bioinformatics , 2020, Journal of molecular cell biology.

[83]  Felix Stiehler,et al.  Helixer: cross-species gene annotation of large eukaryotic genomes using deep learning , 2020, Bioinform..