Identification of microRNA precursors using reduced and hybrid features.

MicroRNAs (also called miRNAs) are a group of short non-coding RNA molecules. They play a vital role in the gene expression of transcriptional and post-transcriptional processes. However, abnormality of their expression has been observed in cancer, heart diseases and nervous system disorders. Therefore for basic research and microRNA based therapy, it is imperative to separate real pre-miRNAs from false ones (hairpin sequences similar to pre-miRNA stem loops). Different conservation and machine learning methods have been applied for the identification of miRNAs. However, machine learning algorithms have gained more popularity than conservative based algorithms in terms of sensitivity and overall performance. Due to the avalanche of RNA sequences discovered in a post-genomic age, it is necessary to construct a predictor for the identification of pre-microRNAs in humans. We have developed a predictor called MicroR-Pred in which the RNA sequences are formulated by a hybrid feature vector. The novelty of the new predictor is in the use of the partial least squares technique followed by the Random Forest and SVM (Support Vector Machine) algorithms for dimension reduction and classification. The performance of the MicroR-Pred model is quite promising compared to other state-of-the-art miRNA predictors. It has achieved 88.40% and 93.90% accuracies for RF and SVM.

[1]  C. Norbury,et al.  The Long and Short of MicroRNA , 2013, Cell.

[2]  Eric C Lai,et al.  microRNAs: Runts of the Genome Assert Themselves , 2003, Current Biology.

[3]  B. Liu,et al.  Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. , 2015, Journal of theoretical biology.

[4]  Ana Kozomara,et al.  miRBase: integrating microRNA annotation and deep-sequencing data , 2010, Nucleic Acids Res..

[5]  Jacques Lapointe,et al.  Theoretical and experimental biology in one—A symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers , 2013 .

[6]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[7]  V. Ambros microRNAs Tiny Regulators with Great Potential , 2001, Cell.

[8]  Louise C. Showe,et al.  Bioinformatics Original Paper Combining Multi-species Genomic Data for Microrna Identification Using a Naı¨ve Bayes Classifier , 2022 .

[9]  Carsten Wiuf,et al.  Ab Initio Identification of Human Micrornas Based on Structure Motifs Ab Initio Identification of Human Micrornas Based on Struc- Ture Motifs , 2007 .

[10]  B. Reinhart,et al.  The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans , 2000, Nature.

[11]  Xiaolong Wang,et al.  iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach , 2016, Journal of biomolecular structure & dynamics.

[12]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[13]  Leonard E. Trigg,et al.  Technical Note: Naive Bayes for Regression , 2000, Machine Learning.

[14]  F. Segovia,et al.  Computer aided diagnosis system for the Alzheimer's disease based on partial least squares and random forest SPECT image classification , 2010, Neuroscience Letters.

[15]  H.-B. Shen,et al.  Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction , 2007, Amino Acids.

[16]  V. Kim MicroRNA biogenesis: coordinated cropping and dicing , 2005, Nature Reviews Molecular Cell Biology.

[17]  Kuo-Chen Chou,et al.  Prediction of Membrane Protein Types by Incorporating Amphipathic Effects , 2005, J. Chem. Inf. Model..

[18]  Byoung-Tak Zhang,et al.  Human microRNA prediction through a probabilistic co-learning model of sequence and structure , 2005, Nucleic acids research.

[19]  Athanasios K. Tsakalidis,et al.  Where we stand, where we are moving: Surveying computational techniques for identifying miRNA genes and uncovering their regulatory role , 2013, J. Biomed. Informatics.

[20]  Fei Li,et al.  Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine , 2005, BMC Bioinformatics.

[21]  Bin Fan,et al.  MiRFinder: an improved approach and software implementation for genome-wide fast microRNA precursor scans , 2007, BMC Bioinformatics.

[22]  G. Ruvkun,et al.  A uniform system for microRNA annotation. , 2003, RNA.

[23]  Todd A. Anderson,et al.  Computational identification of microRNAs and their targets , 2006, Comput. Biol. Chem..

[24]  B. Bartel MicroRNAs directing siRNA biogenesis , 2005, Nature Structural &Molecular Biology.

[25]  K. Chou Prediction of human immunodeficiency virus protease cleavage sites in proteins. , 1996, Analytical biochemistry.

[26]  C. Burge,et al.  Most mammalian mRNAs are conserved targets of microRNAs. , 2008, Genome research.

[27]  Chun Yan,et al.  Prediction of protein subcellular location using a combined feature of sequence , 2005, FEBS letters.

[28]  Peng Jiang,et al.  MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features , 2007, Nucleic Acids Res..

[29]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[30]  K. Chou,et al.  iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. , 2011, Journal of theoretical biology.

[31]  Alexander Schliep,et al.  The discriminant power of RNA features for pre-miRNA recognition , 2013, BMC Bioinformatics.

[32]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[33]  Guo-Ping Zhou,et al.  Subcellular location prediction of apoptosis proteins , 2002, Proteins.

[34]  Mingzhi Liao,et al.  Predicting human microRNA precursors based on an optimized feature subset generated by GA-SVM. , 2011, Genomics.

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Xiaolong Wang,et al.  repRNA: a web server for generating various feature vectors of RNA sequences , 2015, Molecular Genetics and Genomics.

[37]  K. Chou,et al.  A key driving force in determination of protein structural classes. , 1999, Biochemical and biophysical research communications.

[38]  Zhen-Hui Zhang,et al.  A novel method for apoptosis protein subcellular localization prediction combining encoding based on grouped weight and support vector machine , 2006, FEBS letters.

[39]  Juan Manuel Górriz,et al.  SPECT image classification using random forests , 2009 .

[40]  Geoffrey I. Webb,et al.  Multistrategy ensemble learning: reducing error by combining ensemble learning techniques , 2004, IEEE Transactions on Knowledge and Data Engineering.

[41]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[42]  Melanie Hilario,et al.  Approaches to dimensionality reduction in proteomic biomarker studies , 2007, Briefings Bioinform..

[43]  Kuo-Chen Chou,et al.  Prediction and classification of protein subcellular location—sequence‐order effect and pseudo amino acid composition , 2003, Journal of cellular biochemistry.

[44]  Wenbin Li,et al.  PlantMiRNAPred: efficient classification of real and pseudo plant pre-miRNAs , 2011, Bioinform..

[45]  K. Chou Graphic rule for drug metabolism systems. , 2010, Current drug metabolism.

[46]  Lin He,et al.  Application of Pseudo Amino Acid Composition for Predicting Protein Subcellular Location: Stochastic Signal Processing Approach , 2003, Journal of protein chemistry.

[47]  Anne-Laure Boulesteix,et al.  CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data , 2008, BMC Bioinformatics.

[48]  V. Ambros,et al.  The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 , 1993, Cell.

[49]  Ashwin Srinivasan,et al.  Prediction of novel precursor miRNAs using a context-sensitive hidden Markov model (CSHMM) , 2010, BMC Bioinformatics.

[50]  Rolf Backofen,et al.  Global or local? Predicting secondary structure and accessibility in mRNAs , 2012, Nucleic acids research.

[51]  Li Li,et al.  Computational approaches for microRNA studies: a review , 2010, Mammalian Genome.

[52]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[53]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[54]  Patricia Soteropoulos,et al.  Effective classification of microRNA precursors using feature mining and AdaBoost algorithms. , 2013, Omics : a journal of integrative biology.

[55]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[56]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[57]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[58]  K. Chou,et al.  Prediction of linear B-cell epitopes using amino acid pair antigenicity scale , 2007, Amino Acids.

[59]  H. Wold Path Models with Latent Variables: The NIPALS Approach , 1975 .

[60]  Sumeet Dua,et al.  Advanced Clustering Techniques , 2012 .

[61]  V. Ambros The functions of animal microRNAs , 2004, Nature.

[62]  Malik Yousef,et al.  A study of microRNAs in silico and in vivo: bioinformatics approaches to microRNA discovery and target identification , 2009, The FEBS journal.

[63]  Ying Huang,et al.  Prediction of protein subcellular locations using fuzzy k-NN method , 2004, Bioinform..

[64]  Sven Diederichs,et al.  The hallmarks of cancer , 2012, RNA biology.

[65]  George Coukos,et al.  Therapeutic MicroRNA Strategies in Human Cancer , 2009, The AAPS Journal.