Multi-branch Convolutional Neural Network for Identification of Small Non-coding RNA genomic loci

Genomic regions that encode small RNA genes exhibit characteristic patterns in their sequence, secondary structure, and evolutionary conservation. Convolutional Neural Networks are a family of algorithms that can classify data based on learned patterns. Here we present MuStARD an application of Convolutional Neural Networks that can learn patterns associated with user-defined sets of genomic regions, and scan large genomic areas for novel regions exhibiting similar characteristics. We demonstrate that MuStARD is a generic method that can be trained on different classes of human small RNA genomic loci, without need for domain specific knowledge, due to the automated feature and background selection processes built into the model. We also demonstrate the ability of MuStARD for inter-species identification of functional elements by predicting mouse small RNAs (pre-miRNAs and snoRNAs) using models trained on the human genome. MuStARD can be used to filter small RNA-Seq datasets for identification of novel small RNA loci, intra- and inter- species, as demonstrated in three use cases of human, mouse, and fly pre-miRNA prediction. MuStARD is easy to deploy and extend to a variety of genomic classification questions. Code and trained models are freely available at gitlab.com/RBP_Bioinformatics/mustard .

[1]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[2]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[3]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[4]  Xiang Li,et al.  Systematic analysis of human microRNA divergence based on evolutionary emergence , 2011, FEBS Letters.

[5]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[6]  Ming Zhang,et al.  Sno/scaRNAbase: a curated database for small nucleolar RNAs and cajal body-specific RNAs , 2006, Nucleic Acids Res..

[7]  Vasile Palade,et al.  microPred: effective classification of pre-miRNAs for human miRNA gene prediction , 2009, Bioinform..

[8]  Jan Baumbach,et al.  On the performance of pre-microRNA detection algorithms , 2017, Nature Communications.

[9]  Vincent J. Henry,et al.  OMICtools: an informative directory for multi-omic data analysis , 2014, Database J. Biol. Databases Curation.

[10]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[11]  Fariza Tahi,et al.  miRBoost: boosting support vector machines for microRNA precursor classification , 2015, RNA.

[12]  P. Stadler,et al.  RNA Maps Reveal New RNA Classes and a Possible Function for Pervasive Transcription , 2007, Science.

[13]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[14]  J. Kawai,et al.  Tiny RNAs associated with transcription start sites in animals , 2009, Nature Genetics.

[15]  Peter F. Stadler,et al.  SnoReport: computational identification of snoRNAs with unknown targets , 2008, Bioinform..

[16]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[17]  Ana Kozomara,et al.  miRBase: from microRNA sequences to function , 2018, Nucleic Acids Res..

[18]  Fei Li,et al.  Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine , 2005, BMC Bioinformatics.

[19]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[20]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[21]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[22]  Giulia Antonazzo,et al.  FlyBase 2.0: the next generation , 2018, Nucleic Acids Res..

[23]  Jun Liu,et al.  Novel determinants of mammalian primary microRNA processing revealed by systematic evaluation of hairpin-containing transcripts and human genetic variation. , 2017, Genome research.

[24]  Marek Sikora,et al.  HuntMi: an efficient and taxon-specific approach in pre-miRNA identification , 2013, BMC Bioinformatics.

[25]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[26]  Sergio Contrino,et al.  modMine: flexible access to modENCODE data , 2011, Nucleic Acids Res..

[27]  Laurent Lestrade,et al.  snoRNA-LBME-db, a comprehensive database of human H/ACA and C/D box snoRNAs , 2005, Nucleic Acids Res..

[28]  J. Makarova,et al.  SNOntology: Myriads of novel snornas or just a mirage? , 2011, BMC Genomics.

[29]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[30]  Syed Haider,et al.  Ensembl BioMarts: a hub for data retrieval across taxonomic space , 2011, Database J. Biol. Databases Curation.