MLSeq: Machine learning interface for RNA-sequencing data

BACKGROUND AND OBJECTIVE In the last decade, RNA-sequencing technology has become method-of-choice and prefered to microarray technology for gene expression based classification and differential expression analysis since it produces less noisy data. Although there are many algorithms proposed for microarray data, the number of available algorithms and programs are limited for classification of RNA-sequencing data. For this reason, we developed MLSeq, to bring not only frequently used classification algorithms but also novel approaches together and make them available to be used for classification of RNA sequencing data. This package is developed using R language environment and distributed through BIOCONDUCTOR network. METHODS Classification of RNA-sequencing data is not straightforward since raw data should be preprocessed before downstream analysis. With MLSeq package, researchers can easily preprocess (normalization, filtering, transformation etc.) and classify raw RNA-sequencing data using two strategies: (i) to perform algorithms which are directly proposed for RNA-sequencing data structure or (ii) to transform RNA-sequencing data in order to bring it distributionally closer to microarray data structure, and perform algorithms which are developed for microarray data. Moreover, we proposed novel algorithms such as voom (an acronym for variance modelling at observational level) based nearest shrunken centroids (voomNSC), diagonal linear discriminant analysis (voomDLDA), etc. through MLSeq. MATERIALS Three real RNA-sequencing datasets (i.e cervical cancer, lung cancer and aging datasets) were used to evalute model performances. Poisson linear discriminant analysis (PLDA) and negative binomial linear discriminant analysis (NBLDA) were selected as algorithms based on dicrete distributions, and voomNSC, nearest shrunken centroids (NSC) and support vector machines (SVM) were selected as algorithms based on continuous distributions for model comparisons. Each algorithm is compared using classification accuracies and sparsities on an independent test set. RESULTS The algorithms which are based on discrete distributions performed better in cervical cancer and aging data with accuracies above 0.92. In lung cancer data, the most of algorithms performed similar with accuracies of 0.88 except that SVM achieved 0.94 of accuracy. Our voomNSC algorithm was the most sparse algorithm, and able to select 2.2% and 6.6% of all features for cervical cancer and lung cancer datasets respectively. However, in aging data, sparse classifiers were not able to select an optimal subset of all features. CONCLUSION MLSeq is comprehensive and easy-to-use interface for classification of gene expression data. It allows researchers perform both preprocessing and classification tasks through single platform. With this property, MLSeq can be considered as a pipeline for the classification of RNA-sequencing data.

[1]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[2]  R. Tibshirani,et al.  Ultra-high throughput sequencing-based small RNA discovery and discrete statistical biomarker analysis in a collection of cervical tumours and matched controls , 2010, BMC Biology.

[3]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[4]  J. Carpten,et al.  Translating RNA sequencing into clinical diagnostics: opportunities and challenges , 2016, Nature Reviews Genetics.

[5]  Daniela M. Witten,et al.  Classification and clustering of sequencing data using a poisson model , 2011, 1202.6201.

[6]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[7]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[8]  R. Tibshirani,et al.  Normalization, testing, and false discovery rate estimation for RNA-sequencing data. , 2012, Biostatistics.

[9]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  John E. Reid,et al.  Machine learning based classification of cells into chronological stages using single-cell transcriptomics , 2018, Scientific Reports.

[11]  Gokmen Zararsiz,et al.  geneSurv: An interactive web-based tool for survival analysis in genomics research , 2017, Comput. Biol. Medicine.

[12]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[13]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[14]  Lin Yuan,et al.  Pan-cancer transcriptome analysis reveals a gene expression signature for the identification of tumor tissue origin , 2016, Modern Pathology.

[15]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[16]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[17]  Gokmen Zararsiz,et al.  voomDDA: discovery of diagnostic biomarkers and classification of RNA-seq data , 2017, PeerJ.

[18]  Xuegong Zhang,et al.  DEGseq: an R package for identifying differentially expressed genes from RNA-seq data , 2010, Bioinform..

[19]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[20]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[21]  Wenwei Zhang,et al.  Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome , 2012, Nature Biotechnology.

[22]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[23]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[24]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[25]  Thomas D. Wu,et al.  A comprehensive transcriptional portrait of human cancer cell lines , 2014, Nature Biotechnology.

[26]  Sabine C. Mueller,et al.  A blood based 12-miRNA signature of Alzheimer disease patients , 2013, Genome Biology.

[27]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[28]  Peng Liu,et al.  Sample size calculation while controlling false discovery rate for differential expression analysis with RNA-sequencing experiments , 2016, BMC Bioinformatics.

[29]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[30]  Dincer Goksuluk,et al.  A comprehensive simulation study on classification of RNA-Seq data , 2017, PloS one.

[31]  Hongyu Zhao,et al.  NBLDA: negative binomial linear discriminant analysis for RNA-Seq data , 2015, BMC Bioinformatics.

[32]  David G. Robinson,et al.  A nested parallel experiment demonstrates differences in intensity-dependence between RNA-seq and microarrays , 2014, bioRxiv.

[33]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[34]  J. Zyprych-Walczak,et al.  The Impact of Normalization Methods on RNA-Seq Data Analysis , 2015, BioMed research international.