HuntMi: an efficient and taxon-specific approach in pre-miRNA identification

BackgroundMachine learning techniques are known to be a powerful way of distinguishing microRNA hairpins from pseudo hairpins and have been applied in a number of recognised miRNA search tools. However, many current methods based on machine learning suffer from some drawbacks, including not addressing the class imbalance problem properly. It may lead to overlearning the majority class and/or incorrect assessment of classification performance. Moreover, those tools are effective for a narrow range of species, usually the model ones. This study aims at improving performance of miRNA classification procedure, extending its usability and reducing computational time.ResultsWe present HuntMi, a stand-alone machine learning miRNA classification tool. We developed a novel method of dealing with the class imbalance problem called ROC-select, which is based on thresholding score function produced by traditional classifiers. We also introduced new features to the data representation. Several classification algorithms in combination with ROC-select were tested and random forest was selected for the best balance between sensitivity and specificity. Reliable assessment of classification performance is guaranteed by using large, strongly imbalanced, and taxon-specific datasets in 10-fold cross-validation procedure. As a result, HuntMi achieves a considerably better performance than any other miRNA classification tool and can be applied in miRNA search experiments in a wide range of species.ConclusionsOur results indicate that HuntMi represents an effective and flexible tool for identification of new microRNAs in animals, plants and viruses. ROC-select strategy proves to be superior to other methods of dealing with class imbalance problem and can possibly be used in other machine learning classification tasks. The HuntMi software as well as datasets used in the research are freely available at http://lemur.amu.edu.pl/share/HuntMi/.

[1]  Louise C. Showe,et al.  Bioinformatics Original Paper Combining Multi-species Genomic Data for Microrna Identification Using a Naı¨ve Bayes Classifier , 2022 .

[2]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[3]  D. Bartel,et al.  Computational identification of plant microRNAs and their targets, including a stress-induced miRNA. , 2004, Molecular cell.

[4]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[5]  B. Cullen,et al.  Human microRNAs are processed from capped, polyadenylated transcripts that can also function as mRNAs. , 2004, RNA.

[6]  Isaac Bentwich Prediction and validation of microRNAs and their targets , 2005, FEBS letters.

[7]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[8]  Duangdao Wichadakul,et al.  MicroPC (μPC): A comprehensive resource for predicting and comparing plant microRNAs , 2009, BMC Genomics.

[9]  J Wang,et al.  Genetic algorithm-based efficient feature selection for classification of pre-miRNAs. , 2011, Genetics and molecular research : GMR.

[10]  B. Davis-Dusenbery,et al.  Mechanisms of control of microRNA biogenesis. , 2010, Journal of biochemistry.

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Peter F. Stadler,et al.  Hairpins in a Haystack: recognizing microRNA precursors in comparative genomics data , 2006, ISMB.

[13]  Fei Li,et al.  Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine , 2005, BMC Bioinformatics.

[14]  K Han Effective sample selection for classification of pre-miRNAs. , 2011, Genetics and molecular research : GMR.

[15]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[16]  Vasile Palade,et al.  microPred: effective classification of pre-miRNAs for human miRNA gene prediction , 2009, Bioinform..

[17]  Simone Brabletz,et al.  The ZEB1/miR‐200 feedback loop controls Notch signalling in cancer cells , 2011, The EMBO journal.

[18]  Jonathon Doran,et al.  Bio-informatic trends for the determination of miRNA-target interactions in mammals. , 2007, DNA and cell biology.

[19]  Dennis Shasha,et al.  miRò: a miRNA knowledge base , 2009, Database J. Biol. Databases Curation.

[20]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[21]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[22]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[23]  Santosh K. Mishra,et al.  De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures , 2007, Bioinform..

[24]  Guo-Zheng Li,et al.  An asymmetric classifier based on partial least squares , 2010, Pattern Recognit..

[25]  Shuigeng Zhou,et al.  MiRenSVM: towards better prediction of microRNA precursors using an ensemble SVM classifier with multi-loop features , 2010, BMC Bioinformatics.

[26]  N. Rajewsky,et al.  Discovering microRNAs from deep sequencing data using miRDeep , 2008, Nature Biotechnology.

[27]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[28]  Mingzhi Liao,et al.  Predicting human microRNA precursors based on an optimized feature subset generated by GA-SVM. , 2011, Genomics.

[29]  Chih-Jen Lin,et al.  Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel , 2003, Neural Computation.

[30]  Ana Kozomara,et al.  miRBase: integrating microRNA annotation and deep-sequencing data , 2010, Nucleic Acids Res..

[31]  Wenbin Li,et al.  PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs , 2011, Bioinform..

[32]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[33]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[34]  Panayiotis V. Benos,et al.  HHMMiR: efficient de novo prediction of microRNAs using hierarchical hidden Markov models , 2009, BMC Bioinformatics.

[35]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[36]  Sebastian Deorowicz,et al.  miRNEST database: an integrative approach in microRNA search and annotation , 2011, Nucleic Acids Res..

[37]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[38]  David Mease,et al.  Boosted Classification Trees and Class Probability/Quantile Estimation , 2007, J. Mach. Learn. Res..

[39]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.