Delineating the impact of machine learning elements in pre-microRNA detection

Gene regulation modulates RNA expression via transcription factors. Post-transcriptional gene regulation in turn influences the amount of protein product through, for example, microRNAs (miRNAs). Experimental establishment of miRNAs and their effects is complicated and even futile when aiming to establish the entirety of miRNA target interactions. Therefore, computational approaches have been proposed. Many such tools rely on machine learning (ML) which involves example selection, feature extraction, model training, algorithm selection, and parameter optimization. Different ML algorithms have been used for model training on various example sets, more than 1,000 features describing pre-miRNAs have been proposed and different training and testing schemes have been used for model establishment. For pre-miRNA detection, negative examples cannot easily be established causing a problem for two class classification algorithms. There is also no consensus on what ML approach works best and, therefore, we set forth and established the impact of the different parts involved in ML on model performance. Furthermore, we established two new negative datasets and analyzed the impact of using them for training and testing. It was our aim to attach an order of importance to the parts involved in ML for pre-miRNA detection, but instead we found that all parts are intricately connected and their contributions cannot be easily untangled leading us to suggest that when attempting ML-based pre-miRNA detection many scenarios need to be explored.

[1]  Marek Sikora,et al.  HuntMi: an efficient and taxon-specific approach in pre-miRNA identification , 2013, BMC Bioinformatics.

[2]  Eugene Berezikov,et al.  Approaches to microRNA discovery , 2006, Nature Genetics.

[3]  Alexander Schliep,et al.  The discriminant power of RNA features for pre-miRNA recognition , 2013, BMC Bioinformatics.

[4]  WestonJason,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002 .

[5]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[6]  Shuigeng Zhou,et al.  MiRenSVM: towards better prediction of microRNA precursors using an ensemble SVM classifier with multi-loop features , 2010, BMC Bioinformatics.

[7]  E. Hovig,et al.  A Uniform System for the Annotation of Vertebrate microRNA Genes and the Evolution of the Human microRNAome. , 2015, Annual review of genetics.

[8]  Jens Allmer,et al.  A Call for Benchmark Data in Mass Spectrometry-Based Proteomics , 2012 .

[9]  A. E. Erson-Bensan,et al.  Introduction to microRNAs in biological systems. , 2014, Methods in molecular biology.

[10]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[11]  Isaac Bentwich,et al.  Identifying human microRNAs. , 2008, Current topics in microbiology and immunology.

[12]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Filter Feature Selection for One-Class Classification , 2014, Journal of Intelligent & Robotic Systems.

[13]  Zheng Rong Yang,et al.  Machine Learning Approaches to Bioinformatics , 2010, Science, Engineering, and Biology Informatics.

[14]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[15]  Louise C. Showe,et al.  Learning from positive examples when the negative class is undetermined- microRNA gene identification , 2008, Algorithms for Molecular Biology.

[16]  Santosh K. Mishra,et al.  De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures , 2007, Bioinform..

[17]  H WittenIan,et al.  The WEKA data mining software , 2009 .

[18]  William Ritchie,et al.  miREval 2.0: a web tool for simple microRNA prediction in genome sequences , 2008, Bioinform..

[19]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[20]  R. Aharonov,et al.  Identification of hundreds of conserved and nonconserved human microRNAs , 2005, Nature Genetics.

[21]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[22]  Petros Drineas,et al.  Feature selection for linear SVM with provable guarantees , 2014, Pattern Recognit..

[23]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[24]  Jens Allmer,et al.  A Machine Learning Approach for MicroRNA Precursor Prediction in Retro-transcribing Virus Genomes , 2016, J. Integr. Bioinform..

[25]  Thorsten Meinl,et al.  KNIME: The Konstanz Information Miner , 2007, GfKl.

[26]  Jan-Peter Nap,et al.  In silico miRNA prediction in metazoan genomes: balancing between sensitivity and specificity , 2009, BMC Genomics.

[27]  Richard Simon,et al.  Bias in error estimation when using cross-validation for model selection , 2006, BMC Bioinformatics.

[28]  J Wang,et al.  Genetic algorithm-based efficient feature selection for classification of pre-miRNAs. , 2011, Genetics and molecular research : GMR.

[29]  Jens Allmer,et al.  Machine learning methods for microRNA gene prediction. , 2014, Methods in molecular biology.

[30]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[31]  Michael A. White,et al.  A new feature selection algorithm for two-class classification problems and application to endometrial cancer , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[32]  Jens Allmer,et al.  Comparison of Four Ab Initio MicroRNA Prediction Tools , 2013, BIOINFORMATICS.

[33]  Jens Allmer,et al.  Feature Selection for MicroRNA Target Prediction - Comparison of One-Class Feature Selection Methodologies , 2016, BIOINFORMATICS.

[34]  D. Bartel MicroRNAs Genomics, Biogenesis, Mechanism, and Function , 2004, Cell.

[35]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[36]  Junjie Chen,et al.  iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions , 2016, Scientific Reports.

[37]  Jens Allmer,et al.  Differential Expression of Toxoplasma gondii MicroRNAs in Murine and Human Hosts , 2016 .

[38]  Chi-Ying F. Huang,et al.  miRTarBase: a database curates experimentally validated microRNA–target interactions , 2010, Nucleic Acids Res..

[39]  Jens Allmer,et al.  Feature Selection Has a Large Impact on One-Class Classification Accuracy for MicroRNAs in Plants , 2016, Adv. Bioinformatics.

[40]  Vasile Palade,et al.  microPred: effective classification of pre-miRNAs for human miRNA gene prediction , 2009, Bioinform..

[41]  Georgina Stegmayer,et al.  miRNAfe: A comprehensive tool for feature extraction in microRNA prediction , 2015, Biosyst..

[42]  Jens Allmer,et al.  Computational methods for ab initio detection of microRNAs , 2012, Front. Gene..

[43]  Jens Allmer,et al.  Data mining for microrna gene prediction: On the impact of class imbalance and feature number for microrna gene prediction , 2013, 2013 8th International Symposium on Health Informatics and Bioinformatics.

[44]  Stijn van Dongen,et al.  miRBase: microRNA sequences, targets and gene nomenclature , 2005, Nucleic Acids Res..