Genome-wide pre-miRNA discovery from few labeled examples

Motivation Although many machine learning techniques have been proposed for distinguishing miRNA hairpins from other stem-loop sequences, most of the current methods use supervised learning, which requires a very good set of positive and negative examples. Those methods have important practical limitations when they have to be applied to a real prediction task. First, there is the challenge of dealing with a scarce number of positive (well-known) pre-miRNA examples. Secondly, it is very difficult to build a good set of negative examples for representing the full spectrum of non-miRNA sequences. Thirdly, in any genome, there is a huge class imbalance (1: 10 000) that is well-known for particularly affecting supervised classifiers. Results To enable efficient and speedy genome-wide predictions of novel miRNAs, we present miRNAss, which is a novel method based on semi-supervised learning. It takes advantage of the information provided by the unlabeled stem-loops, thereby improving the prediction rates, even when the number of labeled examples is low and not representative of the classes. An automatic method for searching negative examples to initialize the algorithm is also proposed so as to spare the user this difficult task. MiRNAss obtained better prediction rates and shorter execution times than state-of-the-art supervised methods. It was validated with genome-wide data from three model species, with more than one million of hairpin sequences each, thereby demonstrating its applicability to a real prediction task. Availability and implementation An R package can be downloaded from https://cran.r-project.org/package=miRNAss. In addition, a web-demo for testing the algorithm is available at http://fich.unl.edu.ar/sinc/web-demo/mirnass. All the datasets that were used in this study and the sets of predicted pre-miRNA are available on http://sourceforge.net/projects/sourcesinc/files/mirnass. Contact cyones@sinc.unl.edu.ar. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[2]  Liu Wenyuan,et al.  The Training Set Selection Methods of microRNA Precursors Prediction Based on Machine Learning Approaches , 2013, 2013 Third International Conference on Intelligent System Design and Engineering Applications.

[3]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[4]  Bo Wei,et al.  MiRPara: a SVM-based software tool for prediction of most probable microRNA coding regions in genome scale sequences , 2011, BMC Bioinformatics.

[5]  Thomas G. Dietterich Adaptive computation and machine learning , 1998 .

[6]  Kyle K. Biggar,et al.  A framework for improving microRNA prediction in non-human genomes , 2015, Nucleic acids research.

[7]  Weixiong Zhang,et al.  MicroRNA prediction with a novel ranking algorithm based on random walks , 2008, ISMB.

[8]  Peter F. Stadler,et al.  ViennaRNA Package 2.0 , 2011, Algorithms for Molecular Biology.

[9]  C. Nelson,et al.  miRDeep*: an integrated application tool for miRNA identification from RNA sequencing data , 2012, Nucleic acids research.

[10]  Christopher Phillips,et al.  Viability of in-house datamarting approaches for population genetics analysis of SNP genotypes , 2008, DTMBIO '08.

[11]  Alexander Schliep,et al.  The discriminant power of RNA features for pre-miRNA recognition , 2013, BMC Bioinformatics.

[12]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[13]  Thorsten Joachims,et al.  Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[14]  Geir Skogerbø,et al.  Integrated Sequence-Structure Motifs Suffice to Identify microRNA Precursors , 2012, PloS one.

[15]  Ashwani Jha,et al.  miR-BAG: Bagging Based Identification of MicroRNA Precursors , 2012, PloS one.

[16]  R. Aharonov,et al.  Identification of hundreds of conserved and nonconserved human microRNAs , 2005, Nature Genetics.

[17]  Bernhard Schölkopf,et al.  Semi-Supervised Learning (Adaptive Computation and Machine Learning) , 2006 .

[18]  Santosh K. Mishra,et al.  De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures , 2007, Bioinform..

[19]  G. Rubin,et al.  Computational identification of Drosophila microRNA genes , 2003, Genome Biology.

[20]  Vladimir Krylov,et al.  Approximate nearest neighbor algorithm based on navigable small world graphs , 2014, Inf. Syst..

[21]  W. Gander,et al.  A constrained eigenvalue problem , 1988 .

[22]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[23]  Wenbin Li,et al.  PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs , 2011, Bioinform..

[24]  Marek Sikora,et al.  HuntMi: an efficient and taxon-specific approach in pre-miRNA identification , 2013, BMC Bioinformatics.

[25]  B. Charrier,et al.  Computational prediction and experimental validation of microRNAs in the brown alga Ectocarpus siliculosus , 2013, Nucleic acids research.

[26]  Fei Li,et al.  Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine , 2005, BMC Bioinformatics.

[27]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[28]  A. Adai,et al.  Computational prediction of miRNAs in Arabidopsis thaliana. , 2005, Genome research.

[29]  Yue Gao,et al.  Improved and promising identification of human MicroRNAs by incorporating a high-quality negative set , 2014, TCBB.

[30]  Athanasios K. Tsakalidis,et al.  Where we stand, where we are moving: Surveying computational techniques for identifying miRNA genes and uncovering their regulatory role , 2013, J. Biomed. Informatics.

[31]  Alejandro A. Schäffer,et al.  A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences , 2006, J. Comput. Biol..

[32]  David W. Aha,et al.  A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms , 1997, Artificial Intelligence Review.

[33]  T. Schlick,et al.  RAG: RNA-As-Graphs database—concepts, analysis, and features , 1987 .

[34]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[35]  Vasile Palade,et al.  microPred: effective classification of pre-miRNAs for human miRNA gene prediction , 2009, Bioinform..

[36]  Anton J. Enright,et al.  BioLayout-an automatic graph layout algorithm for similarity visualization , 2001, Bioinform..

[37]  Georgina Stegmayer,et al.  miRNAfe: A comprehensive tool for feature extraction in microRNA prediction , 2015, Biosyst..

[38]  Bin Fan,et al.  MiRFinder: an improved approach and software implementation for genome-wide fast microRNA precursor scans , 2007, BMC Bioinformatics.

[39]  Panayiotis V. Benos,et al.  HHMMiR: efficient de novo prediction of microRNAs using hierarchical hidden Markov models , 2009, BMC Bioinformatics.

[40]  Alexander Schliep,et al.  Automatic learning of pre-miRNAs from different species , 2015, BMC Bioinformatics.