A comparison of methods for differential expression analysis of RNA-seq data

BackgroundFinding genes that are differentially expressed between conditions is an integral part of understanding the molecular basis of phenotypic variation. In the past decades, DNA microarrays have been used extensively to quantify the abundance of mRNA corresponding to different genes, and more recently high-throughput sequencing of cDNA (RNA-seq) has emerged as a powerful competitor. As the cost of sequencing decreases, it is conceivable that the use of RNA-seq for differential expression analysis will increase rapidly. To exploit the possibilities and address the challenges posed by this relatively new type of data, a number of software packages have been developed especially for differential expression analysis of RNA-seq data.ResultsWe conducted an extensive comparison of eleven methods for differential expression analysis of RNA-seq data. All methods are freely available within the R framework and take as input a matrix of counts, i.e. the number of reads mapping to each genomic feature of interest in each of a number of samples. We evaluate the methods based on both simulated data and real RNA-seq data.ConclusionsVery small sample sizes, which are still common in RNA-seq experiments, impose problems for all evaluated methods and any results obtained under such conditions should be interpreted with caution. For larger sample sizes, the methods combining a variance-stabilizing transformation with the ‘limma’ method for differential expression analysis perform well under many different conditions, as does the nonparametric SAMseq method.

[1]  Susan R. Wilson,et al.  Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing , 2012, BMC Genomics.

[2]  Robert Tibshirani,et al.  Finding consistent patterns: A nonparametric approach for identifying differential expression in RNA-Seq data , 2013, Statistical methods in medical research.

[3]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[4]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[5]  Matthew D. Young,et al.  From RNA-seq reads to differential expression results , 2010, Genome Biology.

[6]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[7]  Tieliu Shi,et al.  Overview of available methods for diverse RNA-Seq data analyses , 2011, Science China Life Sciences.

[8]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[9]  S. Luo,et al.  mRNA-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain. , 2010, Genome research.

[10]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[11]  R. Doerge,et al.  Differential expression--the next generation and beyond. , 2012, Briefings in functional genomics.

[12]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[13]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[14]  Crispin J. Miller,et al.  A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling , 2010, BMC Genomics.

[15]  Fred A. Wright,et al.  A powerful and flexible approach to the analysis of RNA sequence count data , 2011, Bioinform..

[16]  Dan Nettleton,et al.  Estimation of False Discovery Rate Using Permutation P-Values with Different Discrete Null Distributions , 2009 .

[17]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[18]  Ning Leng,et al.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments , 2013, Bioinform..

[19]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[20]  Vanessa M Kvam,et al.  A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. , 2012, American journal of botany.

[21]  A. Conesa,et al.  Differential expression in RNA-seq: a matter of depth. , 2011, Genome research.

[22]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[23]  Ning Leng,et al.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments , 2013, Bioinform..

[24]  M. Stephens,et al.  Sex-specific and lineage-specific alternative splicing in primates. , 2010, Genome research.

[25]  H. Rue,et al.  Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations , 2009 .

[26]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[27]  Jeff H. Chang,et al.  The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq , 2011 .

[28]  M. Gerstein,et al.  Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays , 2010, BMC Genomics.

[29]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[30]  D. Cocchi,et al.  Multiple testing on standardized mortality ratios: a Bayesian hierarchical model for FDR estimation. , 2011, Biostatistics.

[31]  A. W. van der Vaart,et al.  Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. , 2013, Biostatistics.

[32]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[33]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[34]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[35]  R. Spielman,et al.  Polymorphic Cis- and Trans-Regulation of Human Gene Expression , 2010, PLoS biology.

[36]  Daniel Bottomly,et al.  Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays , 2011, PloS one.

[37]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[38]  L. AuerPaul,et al.  A Two-Stage Poisson Model for Testing RNA-Seq Data , 2011 .

[39]  I. Nookaew,et al.  A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae , 2012, Nucleic acids research.