miRFam: an effective automatic miRNA classification method based on n-grams and a multiclass SVM

BackgroundMicroRNAs (miRNAs) are ~22 nt long integral elements responsible for post-transcriptional control of gene expressions. After the identification of thousands of miRNAs, the challenge is now to explore their specific biological functions. To this end, it will be greatly helpful to construct a reasonable organization of these miRNAs according to their homologous relationships. Given an established miRNA family system (e.g. the miRBase family organization), this paper addresses the problem of automatically and accurately classifying newly found miRNAs to their corresponding families by supervised learning techniques. Concretely, we propose an effective method, miRFam, which uses only primary information of pre-miRNAs or mature miRNAs and a multiclass SVM, to automatically classify miRNA genes.ResultsAn existing miRNA family system prepared by miRBase was downloaded online. We first employed n-grams to extract features from known precursor sequences, and then trained a multiclass SVM classifier to classify new miRNAs (i.e. their families are unknown). Comparing with miRBase's sequence alignment and manual modification, our study shows that the application of machine learning techniques to miRNA family classification is a general and more effective approach. When the testing dataset contains more than 300 families (each of which holds no less than 5 members), the classification accuracy is around 98%. Even with the entire miRBase15 (1056 families and more than 650 of them hold less than 5 samples), the accuracy surprisingly reaches 90%.ConclusionsBased on experimental results, we argue that miRFam is suitable for application as an automated method of family classification, and it is an important supplementary tool to the existing alignment-based small non-coding RNA (sncRNA) classification methods, since it only requires primary sequence information.AvailabilityThe source code of miRFam, written in C++, is freely and publicly available at: http://admis.fudan.edu.cn/projects/miRFam.htm.

[1]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[2]  Geoffrey J. Barton,et al.  Human miRNA Precursors with Box H/ACA snoRNA Features , 2009, PLoS Comput. Biol..

[3]  C. Burge,et al.  The microRNAs of Caenorhabditis elegans. , 2003, Genes & development.

[4]  Sean R. Eddy,et al.  Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction , 2004, BMC Bioinformatics.

[5]  Eric Westhof,et al.  Single Processing Center Models for Human Dicer and Bacterial RNase III , 2004, Cell.

[6]  Ana Kozomara,et al.  miRBase: integrating microRNA annotation and deep-sequencing data , 2010, Nucleic Acids Res..

[7]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[8]  Elena Rivas,et al.  Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs , 2000, Bioinform..

[9]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[10]  A. V. Lobanov,et al.  Genetic Code Supports Targeted Insertion of Two Amino Acids by One Codon , 2009, Science.

[11]  S. Sathiya Keerthi,et al.  Which Is the Best Multiclass SVM Method? An Empirical Study , 2005, Multiple Classifier Systems.

[12]  R. Aharonov,et al.  Identification of hundreds of conserved and nonconserved human microRNAs , 2005, Nature Genetics.

[13]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[14]  Sam Griffiths-Jones,et al.  Annotating noncoding RNA genes. , 2007, Annual review of genomics and human genetics.

[15]  Jan-Peter Nap,et al.  In silico miRNA prediction in metazoan genomes: balancing between sensitivity and specificity , 2009, BMC Genomics.

[16]  Ryan D. Morin,et al.  Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. , 2008, Genome research.

[17]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[18]  Zhang-xun Wang,et al.  Identification and Characterization of Novel MicroRNAs from Schistosoma japonicum , 2008, PloS one.

[19]  Geoffrey J. Barton,et al.  Jalview Version 2—a multiple sequence alignment editor and analysis workbench , 2009, Bioinform..

[20]  D. Bartel MicroRNAs Genomics, Biogenesis, Mechanism, and Function , 2004, Cell.

[21]  N. Rajewsky,et al.  A human snoRNA with microRNA-like functions. , 2008, Molecular cell.

[22]  N. Rajewsky,et al.  Deep conservation of microRNA-target relationships and 3'UTR motifs in vertebrates, flies, and nematodes. , 2006, Cold Spring Harbor symposia on quantitative biology.

[23]  W. Gilbert Origin of life: The RNA world , 1986, Nature.

[24]  Tyler Risom,et al.  Evolutionary conservation of microRNA regulatory circuits: an examination of microRNA gene complexity and conserved microRNA-target interactions through metazoan phylogeny. , 2007, DNA and cell biology.

[25]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[26]  Jidong Liu,et al.  Control of protein synthesis and mRNA degradation by microRNAs. , 2008, Current opinion in cell biology.

[27]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[28]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[29]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[30]  M. Levine,et al.  miRTRAP, a computational method for the systematic identification of miRNAs from high throughput sequencing data , 2010, Genome Biology.

[31]  Kristin Reiche,et al.  Structural profiles of human miRNA families from pairwise clustering , 2009, Bioinform..

[32]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[33]  Rolf Backofen,et al.  Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering , 2007, PLoS Comput. Biol..

[34]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[35]  N. Rajewsky,et al.  Discovering microRNAs from deep sequencing data using miRDeep , 2008, Nature Biotechnology.

[36]  G. Church,et al.  Computational and experimental identification of C. elegans microRNAs. , 2003, Molecular cell.

[37]  BMC Bioinformatics , 2005 .

[38]  D. Bartel,et al.  MicroRNAS and their regulatory roles in plants. , 2006, Annual review of plant biology.

[39]  V. Ambros,et al.  The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 , 1993, Cell.

[40]  A. Bradley,et al.  Identification of mammalian microRNA host genes and transcription units. , 2004, Genome research.

[41]  Yasunori Murakami,et al.  Mapping the face in the somatosensory brainstem , 2010, Nature Reviews Neuroscience.

[42]  Z. Weng,et al.  Sorting of Drosophila small silencing RNAs partitions microRNA* strands into the RNA interference pathway. , 2010, RNA.

[43]  Alessandra Carbone,et al.  MIReNA: finding microRNAs with high accuracy and no learning at genome scale and from deep sequencing data , 2010, Bioinform..

[44]  Santosh K. Mishra,et al.  De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures , 2007, Bioinform..

[45]  Ana M. Aransay,et al.  miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments , 2009, Nucleic Acids Res..

[46]  Stijn van Dongen,et al.  miRBase: tools for microRNA genomics , 2007, Nucleic Acids Res..

[47]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[48]  Ping Wu,et al.  PmiRKB: a plant microRNA knowledge base , 2010, Nucleic Acids Res..

[49]  Robert D. Finn,et al.  Rfam: updates to the RNA families database , 2008, Nucleic Acids Res..

[50]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[51]  Sam Griffiths-Jones,et al.  The microRNA Registry , 2004, Nucleic Acids Res..

[52]  Sean R. Eddy,et al.  Infernal 1.0: inference of RNA alignments , 2009, Bioinform..

[53]  H. Brunner Annual Review of Genomics and Human Genetics , 2001, European Journal of Human Genetics.

[54]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[55]  Stefano Piccolo,et al.  MicroRNA control of signal transduction , 2010, Nature Reviews Molecular Cell Biology.