iMiRNA-SSF: Improving the Identification of MicroRNA Precursors by Combining Negative Sets with Different Distributions

The identification of microRNA precursors (pre-miRNAs) helps in understanding regulator in biological processes. The performance of computational predictors depends on their training sets, in which the negative sets play an important role. In this regard, we investigated the influence of benchmark datasets on the predictive performance of computational predictors in the field of miRNA identification, and found that the negative samples have significant impact on the predictive results of various methods. We constructed a new benchmark set with different data distributions of negative samples. Trained with this high quality benchmark dataset, a new computational predictor called iMiRNA-SSF was proposed, which employed various features extracted from RNA sequences. Experimental results showed that iMiRNA-SSF outperforms three state-of-the-art computational methods. For practical applications, a web-server of iMiRNA-SSF was established at the website http://bioinformatics.hitsz.edu.cn/iMiRNA-SSF/.

[1]  B. Liu,et al.  PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation , 2015, Molecular informatics.

[2]  Ivo L. Hofacker,et al.  Vienna RNA secondary structure server , 2003, Nucleic Acids Res..

[3]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[4]  Xiaolong Wang,et al.  miRNA-dis: microRNA precursor identification based on distance structure status pairs. , 2015, Molecular bioSystems.

[5]  Ana Kozomara,et al.  miRBase: integrating microRNA annotation and deep-sequencing data , 2010, Nucleic Acids Res..

[6]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[7]  Byoung-Tak Zhang,et al.  ProMiR II: a web server for the probabilistic prediction of clustered, nonclustered, conserved and nonconserved microRNAs , 2006, Nucleic Acids Res..

[8]  Lin He,et al.  MicroRNAs: small RNAs with a big role in gene regulation , 2004, Nature Reviews Genetics.

[9]  Ana Kozomara,et al.  miRBase: annotating high confidence microRNAs using deep sequencing data , 2013, Nucleic Acids Res..

[10]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[11]  B. Liu,et al.  DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation , 2015, Scientific Reports.

[12]  C. Burge,et al.  Vertebrate MicroRNA Genes , 2003, Science.

[13]  B. Liu,et al.  iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition , 2014, PloS one.

[14]  Geir Skogerbø,et al.  Integrated Sequence-Structure Motifs Suffice to Identify microRNA Precursors , 2012, PloS one.

[15]  Fei Li,et al.  Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine , 2005, BMC Bioinformatics.

[16]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[17]  G. Ruvkun,et al.  A uniform system for microRNA annotation. , 2003, RNA.

[18]  D. Bartel MicroRNAs Genomics, Biogenesis, Mechanism, and Function , 2004, Cell.

[19]  Jianzhen Xu,et al.  Connect the dots , 2013, Autophagy.

[20]  Q. Zou,et al.  Similarity computation strategies in the microRNA-disease network: a survey. , 2015, Briefings in functional genomics.

[21]  A. Krogh,et al.  No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution. , 1999, Nucleic acids research.

[22]  Kwong-Sak Leung,et al.  ViRBase: a resource for virus–host ncRNA-associated interactions , 2014, Nucleic Acids Res..

[23]  B. Liu,et al.  Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. , 2015, Journal of theoretical biology.

[24]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[25]  Anthony K. H. Tung,et al.  What is Unequal among the Equals? Ranking Equivalent Rules from Gene Expression Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[26]  Y. Wang,et al.  Mammalian ncRNA-disease repository: a global view of ncRNA-mediated disease network , 2013, Cell Death and Disease.

[27]  Quan Zou,et al.  A Discussion of MicroRNAs in Cancers , 2014 .

[28]  Xia Li,et al.  RAID: a comprehensive resource for human RNA-associated (RNA–RNA/RNA–protein) interaction , 2014, RNA.

[29]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Fei Li,et al.  MicroRNA identification based on sequence and structure alignment , 2005, Bioinform..

[31]  Xiaowei Yang,et al.  An efficient gene selection algorithm based on mutual information , 2009, Neurocomputing.

[32]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[33]  Xiaolong Wang,et al.  repRNA: a web server for generating various feature vectors of RNA sequences , 2015, Molecular Genetics and Genomics.

[34]  Xia Li,et al.  ncRDeathDB: A comprehensive bioinformatics resource for deciphering network organization of the ncRNA-mediated cell death system , 2015, Autophagy.

[35]  Malik Yousef,et al.  A study of microRNAs in silico and in vivo: bioinformatics approaches to microRNA discovery and target identification , 2009, The FEBS journal.

[36]  Xiaolong Wang,et al.  iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach , 2016, Journal of biomolecular structure & dynamics.

[37]  Quan Zou,et al.  Briefing in family characteristics of microRNAs and their applications in cancer research. , 2014, Biochimica et biophysica acta.

[38]  Q. Zou,et al.  Prediction of MicroRNA-Disease Associations Based on Social Network Analysis Methods , 2015, BioMed Research International.

[39]  S. Cox,et al.  Evidence that miRNAs are different from other RNAs , 2006, Cellular and Molecular Life Sciences CMLS.

[40]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[41]  Xiangxiang Zeng,et al.  nDNA-prot: identification of DNA-binding proteins based on unbalanced classification , 2014, BMC Bioinformatics.

[42]  Ruichu Cai,et al.  BASSUM: A Bayesian semi-supervised method for classification feature selection , 2011, Pattern Recognit..

[43]  Junjie Chen,et al.  Application of learning to rank to protein remote homology detection , 2015, Bioinform..

[44]  Ruichu Cai,et al.  Causal gene identification using combinatorial V-structure search , 2013, Neural Networks.

[45]  B. Liu,et al.  Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach , 2015, PloS one.

[46]  Ola Snøve,et al.  Conserved microRNA characteristics in mammals. , 2006, Oligonucleotides.

[47]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[48]  Ana M. Aransay,et al.  miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments , 2009, Nucleic Acids Res..

[49]  B. Liu,et al.  Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis , 2015, Molecular Genetics and Genomics.

[50]  Xiaolong Wang,et al.  Protein Binding Site Prediction by Combining Hidden Markov Support Vector Machine and Profile-Based Propensities , 2014, TheScientificWorldJournal.

[51]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[52]  Quan Zou,et al.  Exploratory Predicting Protein Folding Model with Random Forest and Hybrid Features , 2014 .

[53]  Ke Chen,et al.  Survey of MapReduce frame operation in bioinformatics , 2013, Briefings Bioinform..

[54]  Yves Van de Peer,et al.  Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences , 2004, Bioinform..