Robust and efficient identification of biomarkers from RNA-Seq data using median control chart

Background: One of the main goals of RNA-seq data analysis is identification of biomarkers that are differentially expressed (DE) across two or more experimental conditions. RNA-seq uses next generation sequencing technology and it has many advantages over microarrays. Numerous statistical methods have already been developed for identification the biomarkers from RNA-seq data. Most of these methods were based on either Poisson distribution or negative binomial distribution. However, efficient biomarker identification from discrete RNA-seq data is hampered by existing methods when the datasets contain outliers or extreme observations. Specially, the performance of these methods becomes more severe when the data come from a small number of samples in the presence of outliers. Therefore, in this study, an attempt is made to propose an outlier detection and modification approach for RNA-seq data to overcome the aforesaid problems of traditional methods. We make our proposed method facilitate in RNA-seq data by transforming the read count data into continuous data. Methods: We use median control chart to detect and modify the outlying observation in a log-transformed RNA-seq dataset. To investigate the performance of the proposed method in absence and presence of outliers, we employ the five popular biomarker selection methods (edgeR, edgeR_robust, DEseq, DEseq2 and limma) both in simulated and real datasets. Results: The simulation results strongly suggest that the performance of the proposed method improved in the presence of outliers. The proposed method also detected an additional 18 outlying DE genes from a real mouse RNA-seq dataset that were not detected by traditional methods. Using the KEGG pathway and gene ontology analysis results we reveal that these genes may be biomarkers, which require validation in a wet lab. Conclusions: Our proposal is to apply the proposed method for biomarker identification from other RNA-seq data.

[1]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[2]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[3]  Wei Liu,et al.  Statistical Applications in Genetics and Molecular Biology Linear Combination Test for Hierarchical Gene Set Analysis , 2011 .

[4]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[5]  Michael P. Snyder,et al.  RNA‐Seq: A Method for Comprehensive Transcriptome Analysis , 2010, Current protocols in molecular biology.

[6]  Chi-Ying F. Huang,et al.  miRTarBase: a database curates experimentally validated microRNA–target interactions , 2010, Nucleic Acids Res..

[7]  Damian Szklarczyk,et al.  The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible , 2016, Nucleic Acids Res..

[8]  Jeff H. Chang,et al.  The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq , 2011 .

[9]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[10]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[11]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[12]  Mark D. Robinson,et al.  Robustly detecting differential expression in RNA sequencing data using observation weights , 2013, Nucleic acids research.

[13]  Ching-Wei Chang,et al.  An Iterative Leave-One-Out Approach to Outlier Detection in RNA-Seq Data , 2015, PloS one.

[14]  Harald Binder,et al.  Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures , 2014, PloS one.

[15]  Xuegong Zhang,et al.  DEGseq: an R package for identifying differentially expressed genes from RNA-seq data , 2010, Bioinform..

[16]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[17]  Daniel Bottomly,et al.  Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays , 2011, PloS one.

[18]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[19]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[20]  Ning Leng,et al.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments , 2013, Bioinform..

[21]  M. Mallmann,et al.  High-Resolution Transcriptome of Human Macrophages , 2012, PloS one.