High throughput sequencing technologies are supplanting microarrays as the preferred technology for detecting and quantifying differential gene expression. The raw data produced by the a technique known as RNA-sequencing (RNA-seq), consists of integer counts of reverse transcribed cDNA fragment reads mapped onto each gene or transcript isoform in a reference genome or transcriptome. Many software packages exist for analysing RNA-seq datasets consisting of tables of mapped read counts from biological or technical replicate experiments under two or more conditions, the purpose being to detect which genes are differentially expressed between conditions. Two state-of-the-art packages, DESeq and edgeR, are based on a negative binomial model of read counts. Our tests with simulated data constructed according to the statistical model assumed by these packages reveal that both packages generate a non-uniform p-value spectrum from null-hypothesis data. We demonstrate how specific knowledge of the non-uniformity can be exploited to develop a graphical technique based on the Storey-Tibshirani method for improving estimates of p-values and false discovery rates in databases where differential expression is present. We have developed an add-on package for DESeq and edgeR, called Polyfit, which implements this method, and evaluate its performance against DESeq, edgeR and another recently introduced package, PoissonSeq, using simulated data.
[1]
Susan R. Wilson,et al.
Error estimates for the analysis of differential expression from RNA-seq count data
,
2014,
PeerJ.
[2]
L. J.,et al.
Normalization , testing , and false discovery rate estimation for RNA-sequencing data
,
2012
.
[3]
Mark D. Robinson,et al.
edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
,
2009,
Bioinform..
[4]
W. Huber,et al.
Differential expression analysis for sequence count data
,
2010
.
[5]
Charlotte Soneson,et al.
A comparison of methods for differential expression analysis of RNA-seq data
,
2013,
BMC Bioinformatics.
[6]
Susan R. Wilson,et al.
Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing
,
2012,
BMC Genomics.
[7]
John D. Storey,et al.
Statistical significance for genomewide studies
,
2003,
Proceedings of the National Academy of Sciences of the United States of America.