Bioinformatics Original Paper Combining Multi-species Genomic Data for Microrna Identification Using a Naı¨ve Bayes Classifier

MOTIVATION Most computational methodologies for microRNA gene prediction utilize techniques based on sequence conservation and/or structural similarity. In this study we describe a new technique, which is applicable across several species, for predicting miRNA genes. This technique is based on machine learning, using the Naive Bayes classifier. It automatically generates a model from the training data, which consists of sequence and structure information of known miRNAs from a variety of species. RESULTS Our study shows that the application of machine learning techniques, along with the integration of data from multiple species is a useful and general approach for miRNA gene prediction. Based on our experiments, we believe that this new technique is applicable to an extensive range of eukaryotes' genomes. Specific structure and sequence features are first used to identify miRNAs followed by a comparative analysis to decrease the number of false positives (FPs). The resulting algorithm exhibits higher specificity and similar sensitivity compared to currently used algorithms that rely on conserved genomic regions to decrease the rate of FPs.

[1]  Michel J. Weber New human and mouse microRNA genes found by homology search , 2004, The FEBS journal.

[2]  C. Metz Basic principles of ROC analysis. , 1978, Seminars in nuclear medicine.

[3]  V. Ambros,et al.  An Extensive Class of Small RNAs in Caenorhabditis elegans , 2001, Science.

[4]  C. Burge,et al.  Vertebrate MicroRNA Genes , 2003, Science.

[5]  T. Tuschl,et al.  Identification of Novel Genes Coding for Small Expressed RNAs , 2001, Science.

[6]  Fei Li,et al.  MicroRNA identification based on sequence and structure alignment , 2005, Bioinform..

[7]  K. Lindblad-Toh,et al.  Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals , 2005, Nature.

[8]  B. Reinhart,et al.  Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA , 2000, Nature.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Michael Zuker,et al.  Mfold web server for nucleic acid folding and hybridization prediction , 2003, Nucleic Acids Res..

[11]  Stijn van Dongen,et al.  miRBase: microRNA sequences, targets and gene nomenclature , 2005, Nucleic Acids Res..

[12]  Sam Griffiths-Jones,et al.  The microRNA Registry , 2004, Nucleic Acids Res..

[13]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[14]  G. Ruvkun,et al.  A uniform system for microRNA annotation. , 2003, RNA.

[15]  J. Sabina,et al.  Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. , 1999, Journal of molecular biology.

[16]  G. Rubin,et al.  Computational identification of Drosophila microRNA genes , 2003, Genome Biology.

[17]  Byoung-Tak Zhang,et al.  Human microRNA prediction through a probabilistic co-learning model of sequence and structure , 2005, Nucleic acids research.

[18]  D. Bartel MicroRNAs Genomics, Biogenesis, Mechanism, and Function , 2004, Cell.

[19]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[20]  Graziano Pesole,et al.  UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs , 2004, Nucleic Acids Res..

[21]  G. Church,et al.  Computational and experimental identification of C. elegans microRNAs. , 2003, Molecular cell.

[22]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[23]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[24]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[25]  G. Stormo,et al.  Combining SELEX with quantitative assays to rapidly obtain accurate models of protein–DNA interactions , 2005, Nucleic acids research.

[26]  C. Burge,et al.  The microRNAs of Caenorhabditis elegans. , 2003, Genes & development.

[27]  L. Lim,et al.  An Abundant Class of Tiny RNAs with Probable Regulatory Roles in Caenorhabditis elegans , 2001, Science.

[28]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..