A Systematic Evaluation of Feature Selection and Classification Algorithms Using Simulated and Real miRNA Sequencing Data

Sequencing is widely used to discover associations between microRNAs (miRNAs) and diseases. However, the negative binomial distribution (NB) and high dimensionality of data obtained using sequencing can lead to low-power results and low reproducibility. Several statistical learning algorithms have been proposed to address sequencing data, and although evaluation of these methods is essential, such studies are relatively rare. The performance of seven feature selection (FS) algorithms, including baySeq, DESeq, edgeR, the rank sum test, lasso, particle swarm optimistic decision tree, and random forest (RF), was compared by simulation under different conditions based on the difference of the mean, the dispersion parameter of the NB, and the signal to noise ratio. Real data were used to evaluate the performance of RF, logistic regression, and support vector machine. Based on the simulation and real data, we discuss the behaviour of the FS and classification algorithms. The Apriori algorithm identified frequent item sets (mir-133a, mir-133b, mir-183, mir-937, and mir-96) from among the deregulated miRNAs of six datasets from The Cancer Genomics Atlas. Taking these findings altogether and considering computational memory requirements, we propose a strategy that combines edgeR and DESeq for large sample sizes.

[1]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[2]  Q. Zou,et al.  Approaches for Recognizing Disease Genes Based on Network , 2014, BioMed research international.

[3]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[4]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[5]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[6]  B. Liu,et al.  An Approach for Identifying Cytokines Based on a Novel Ensemble Classifier , 2013, BioMed research international.

[7]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[8]  Clay B Marsh,et al.  MicroRNA 133B targets pro-survival molecules MCL-1 and BCL2L2 in lung cancer. , 2009, Biochemical and biophysical research communications.

[9]  Thomas Tuschl,et al.  Comprehensive profiling of circulating microRNA via small RNA sequencing of cDNA libraries reveals biomarker potential and limitations , 2013, Proceedings of the National Academy of Sciences.

[10]  Masayuki Kano,et al.  miR‐145, miR‐133a and miR‐133b: Tumor‐suppressive miRNAs target FSCN1 in esophageal squamous cell carcinoma , 2010, International journal of cancer.

[11]  Peter Bühlmann Regression shrinkage and selection via the Lasso: a retrospective (Robert Tibshirani): Comments on the presentation , 2011 .

[12]  C. Burge,et al.  Conserved Seed Pairing, Often Flanked by Adenosines, Indicates that Thousands of Human Genes are MicroRNA Targets , 2005, Cell.

[13]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[14]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[15]  Vanessa M Kvam,et al.  A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. , 2012, American journal of botany.

[16]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[17]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[18]  Vlad I. Morariu,et al.  Expression , 2015, Principles of Molecular Virology.

[19]  Wei-Chung Cheng,et al.  Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm , 2014, BMC Bioinformatics.

[20]  Hsien-Da Huang,et al.  miRTarBase update 2014: an information resource for experimentally validated miRNA-target interactions , 2013, Nucleic Acids Res..

[21]  F Remacle,et al.  miRNA and mRNA cancer signatures determined by analysis of expression levels in large cohorts of patients , 2013, Proceedings of the National Academy of Sciences.

[22]  Paul Bertone,et al.  Systematic comparison of microarray profiling, real-time PCR, and next-generation sequencing technologies for measuring differential microRNA expression. , 2010, RNA.

[23]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[24]  Lingling Hu,et al.  miRClassify: An advanced web server for miRNA family classification and annotation , 2014, Comput. Biol. Medicine.

[25]  Hui Zhang,et al.  Selected isomiR expression profiles via arm switching? , 2014, Gene.

[26]  E. Izaurralde,et al.  Gene silencing by microRNAs: contributions of translational repression and mRNA decay , 2011, Nature Reviews Genetics.

[27]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[28]  Kazuya Kawahara,et al.  MiR‐96 and miR‐183 detection in urine serve as potential tumor markers of urothelial carcinoma: correlation with stage and grade, and comparison with urinary cytology , 2011, Cancer science.