miRWoods: Enhanced precursor detection and stacked random forests for the sensitive detection of microRNAs

MicroRNAs are conserved, endogenous small RNAs with critical post-transcriptional regulatory functions throughout eukaryota, including prominent roles in development and disease. Despite much effort, microRNA annotations still contain errors and are incomplete due especially to challenges related to identifying valid miRs that have small numbers of reads, to properly locating hairpin precursors and to balancing precision and recall. Here, we present miRWoods, which solves these challenges using a duplex-focused precursor detection method and stacked random forests with specialized layers to detect mature and precursor microRNAs, and has been tuned to optimize the harmonic mean of precision and recall. We trained and tuned our discovery pipeline on data sets from the well-annotated human genome, and evaluated its performance on data from mouse. Compared to existing approaches, miRWoods better identifies precursor spans, and can balance sensitivity and specificity for an overall greater prediction accuracy, recalling an average of 10% more annotated microRNAs, and correctly predicts substantially more microRNAs with only one read. We apply this method to the under-annotated genomes of Felis catus (domestic cat) and Bos taurus (cow). We identified hundreds of novel microRNAs in small RNA sequencing data sets from muscle and skin from cat, from 10 tissues from cow and also from human and mouse cells. Our novel predictions include a microRNA in an intron of tyrosine kinase 2 (TYK2) that is present in both cat and cow, as well as a family of mirtrons with two instances in the human genome. Our predictions support a more expanded miR-2284 family in the bovine genome, a larger mir-548 family in the human genome, and a larger let-7 family in the feline genome.

[1]  Michael Zuker,et al.  Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information , 1981, Nucleic Acids Res..

[2]  Haoquan Wu,et al.  A sliding-bulge structure at the Dicer processing site of pre-miRNAs regulates alternative Dicer processing to generate 5′-isomiRs , 2016, Heliyon.

[3]  N. Rajewsky,et al.  Discovering microRNAs from deep sequencing data using miRDeep , 2008, Nature Biotechnology.

[4]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[5]  Adriano S. Arantes,et al.  Expansion of ruminant-specific microRNAs shapes target gene expression divergence between ruminant and non-ruminant species , 2013, BMC Genomics.

[6]  Peter F. Stadler,et al.  ViennaRNA Package 2.0 , 2011, Algorithms for Molecular Biology.

[7]  Xi Chen,et al.  Identification and characterization of novel amphioxus microRNAs by Solexa sequencing , 2009, Genome Biology.

[8]  D. Lynn,et al.  The Role of microRNAs in Bovine Infection and Immunity , 2014, Front. Immunol..

[9]  Sebastian D. Mackowiak,et al.  miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades , 2011, Nucleic acids research.

[10]  Ana Kozomara,et al.  miRBase: annotating high confidence microRNAs using deep sequencing data , 2013, Nucleic Acids Res..

[11]  V. Ambros The functions of animal microRNAs , 2004, Nature.

[12]  M. Levine,et al.  miRTRAP, a computational method for the systematic identification of miRNAs from high throughput sequencing data , 2010, Genome Biology.

[13]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[14]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[15]  Lei Li,et al.  miRDeep-P: a computational tool for analyzing the microRNA transcriptome in plants , 2011, Bioinform..

[16]  Amir K. Foroushani,et al.  Next Generation Sequencing Reveals the Expression of a Unique miRNA Profile in Response to a Gram-Positive Bacterial Infection , 2013, PloS one.

[17]  Heidi J. Peltier,et al.  Normalization of microRNA expression levels in quantitative RT-PCR assays: identification of suitable reference RNA targets in normal and cancerous human solid tissues. , 2008, RNA.

[18]  A. Moorman,et al.  Amplification efficiency: linking baseline and bias in the analysis of quantitative PCR data , 2009, Nucleic acids research.

[19]  Dennis B. Troup,et al.  NCBI GEO: archive for functional genomics data sets—10 years on , 2010, Nucleic Acids Res..

[20]  Ryan D. Morin,et al.  Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. , 2008, Genome research.

[21]  F. Speleman,et al.  Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes , 2002, Genome Biology.

[22]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[23]  C. Nelson,et al.  miRDeep*: an integrated application tool for miRNA identification from RNA sequencing data , 2012, Nucleic acids research.

[24]  Alessandra Carbone,et al.  MIReNA: finding microRNAs with high accuracy and no learning at genome scale and from deep sequencing data , 2010, Bioinform..

[25]  Yong Huang,et al.  Thermodynamic stability of small hairpin RNAs highly influences the loading process of different mammalian Argonautes , 2011, Proceedings of the National Academy of Sciences.

[26]  B. Lenhard,et al.  Mammalian MicroRNA Prediction through a Support Vector Machine Model of Sequence and Structure , 2007, PloS one.

[27]  J. Wootton,et al.  Analysis of compositionally biased regions in sequence databases. , 1996, Methods in enzymology.

[28]  I. King Jordan,et al.  A Family of Human MicroRNA Genes from Miniature Inverted-Repeat Transposable Elements , 2007, PloS one.

[29]  Immunosuppressive property of submandibular lymph nodes in patients with head and neck tumors: differential distribution of regulatory T cells , 2018, BMC research notes.

[30]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[31]  Fariza Tahi,et al.  miRBoost: boosting support vector machines for microRNA precursor classification , 2015, RNA.

[32]  V. Kim,et al.  The nuclear RNase III Drosha initiates microRNA processing , 2003, Nature.

[33]  M. Levine,et al.  A distinct class of small RNAs arises from pre-miRNA–proximal regions in a simple chordate , 2009, Nature Structural &Molecular Biology.

[34]  Manolis Kellis,et al.  Evolution, biogenesis, expression, and target predictions of a substantially expanded set of Drosophila microRNAs. , 2007, Genome research.

[35]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[36]  Shuang Wang,et al.  MicroRNA miR-320a and miR-140 inhibit mink enteritis virus infection by repression of its receptor, feline transferrin receptor , 2014, Virology Journal.

[37]  Ana M. Aransay,et al.  miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments , 2009, Nucleic Acids Res..

[38]  Rolf Jaggi,et al.  MIQE précis: Practical implementation of minimum standard guidelines for fluorescence-based quantitative real-time PCR experiments , 2010, BMC Molecular Biology.

[39]  C. Croce,et al.  Discovery and characterization of the feline miRNAome , 2017, Scientific Reports.

[40]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.