MuStARD: Deep Learning for intra- and inter-species scanning of functional genomic patterns

Regions of the genome that produce different classes of functional elements also exhibit different patterns in their sequence, secondary structure, and evolutionary conservation. Deep Learning is a family of Machine Learning algorithms recently applied to a variety of pattern recognition problems. Here we present MuStARD (gitlab.com/RBP_Bioinformatics/mustard) a Deep Learning framework that can learn and combine sequence, structure, and conservation patterns in sets of functional regions, and accurately identify additional members of the given set over wide genomic areas. MuStARD is designed with general use in mind, and has sophisticated iterative fully-automated background selection capability. We demonstrate that MuStARD can be trained without changes on different classes of human small RNA loci (pre-microRNAs and snoRNAs) and accurately build prediction models for both, outperforming state of the art methods specifically designed for each specific class. Furthermore, we demonstrate the ability of MuStARD for inter-species identification of functional elements by predicting mouse small RNAs using human trained models. MuStARD is easy to deploy and extend to a variety of genomic classification questions.

[1]  K. Nakai,et al.  Sequence comparison of human and mouse genes reveals a homologous block structure in the promoter regions. , 2004, Genome research.

[2]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[3]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[4]  Jun Liu,et al.  Novel determinants of mammalian primary microRNA processing revealed by systematic evaluation of hairpin-containing transcripts and human genetic variation. , 2017, Genome research.

[5]  Peter F. Stadler,et al.  ViennaRNA Package 2.0 , 2011, Algorithms for Molecular Biology.

[6]  Fariza Tahi,et al.  miRBoost: boosting support vector machines for microRNA precursor classification , 2015, RNA.

[7]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[8]  Liang-Hu Qu,et al.  snoSeeker: an advanced computational package for screening of guide and orphan snoRNA genes in the human genome , 2006, Nucleic acids research.

[9]  Marek Sikora,et al.  HuntMi: an efficient and taxon-specific approach in pre-miRNA identification , 2013, BMC Bioinformatics.

[10]  K. Pollard,et al.  Detection of nonneutral substitution rates on mammalian phylogenies. , 2010, Genome research.

[11]  Ting Chen,et al.  Integrative Data Analysis of Multi-Platform Cancer Data with a Multimodal Deep Learning Approach , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[12]  Fei Li,et al.  Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine , 2005, BMC Bioinformatics.

[13]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[14]  Ana Kozomara,et al.  miRBase: from microRNA sequences to function , 2018, Nucleic Acids Res..

[15]  Yanjun Qi,et al.  DeepChrome: deep-learning for predicting gene expression from histone modifications , 2016, Bioinform..

[16]  Frederic B. Fitch,et al.  McCulloch Warren S. and Pitts Walter. A logical calculus of the ideas immanent in nervous activity. Bulletin of mathematical biophysics , vol. 5 (1943), pp. 115–133. , 1944, Journal of Symbolic Logic.

[17]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[18]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[19]  L. Lim,et al.  An Abundant Class of Tiny RNAs with Probable Regulatory Roles in Caenorhabditis elegans , 2001, Science.

[20]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[21]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[22]  Ehsan Qasemi,et al.  Deep Learning Features in Atmospheric Chemistry: Prediction of Cancer Morbidity Due to Air Pollution , 2017, 2017 International Conference on Computational Science and Computational Intelligence (CSCI).

[23]  Vasile Palade,et al.  microPred: effective classification of pre-miRNAs for human miRNA gene prediction , 2009, Bioinform..

[24]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[25]  Ning Chen,et al.  DeepEnhancer: Predicting enhancers by convolutional neural networks , 2016, 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[26]  T. Kiss Small Nucleolar RNAs An Abundant Group of Noncoding RNAs with Diverse Cellular Functions , 2002, Cell.

[27]  Hui Zhou,et al.  deepBase: a database for deeply annotating and mining deep sequencing data , 2009, Nucleic Acids Res..

[28]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[29]  Syed Haider,et al.  Ensembl BioMarts: a hub for data retrieval across taxonomic space , 2011, Database J. Biol. Databases Curation.

[30]  V. Ambros,et al.  An Extensive Class of Small RNAs in Caenorhabditis elegans , 2001, Science.

[31]  Jan Baumbach,et al.  On the performance of pre-microRNA detection algorithms , 2017, Nature Communications.

[32]  Vincent J. Henry,et al.  OMICtools: an informative directory for multi-omic data analysis , 2014, Database J. Biol. Databases Curation.

[33]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[34]  T. Tuschl,et al.  New microRNAs from mouse and human. , 2003, RNA.