Differential analysis of count data – the DESeq2 package

A basic task in the analysis of count data from RNA-Seq is the detection of di↵erentially expressed genes. The count data are presented as a table which reports, for each sample, the number of sequence fragments that have been assigned to each gene. Analogous data also arise for other assay types, including comparative ChIP-Seq, HiC, shRNA screening, mass spectrometry. An important analysis question is the quantification and statistical inference of systematic changes between conditions, as compared to within-condition variability. The package DESeq2 provides methods to test for di↵erential expression by use of negative binomial generalized linear models; the estimates of dispersion and logarithmic fold changes incorporate data-driven prior distributions 1 . This vignette explains the use of the package and demonstrates typical work flows. Another vignette, “Beginner’s guide to using the DESeq2 package”, covers similar material but at a slower pace, including the generation of count tables from FASTQ files.

[1]  R. Tibshirani Estimating Transformations for Regression via Additivity and Variance Stabilization , 1988 .

[2]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[3]  Hao Wu,et al.  A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data , 2012, Biostatistics.

[4]  Lior Pachter,et al.  Near-optimal RNA-Seq quantification , 2015, ArXiv.

[5]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[6]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[7]  A. Poustka,et al.  Parameter estimation for the calibration and variance stabilization of microarray data , 2003, Statistical applications in genetics and molecular biology.

[8]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[9]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[10]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[11]  Judith B. Zaugg Data-driven hypothesis weighting increases detection power in big data analytics , 2015 .

[12]  Wolfgang Huber,et al.  Data-driven hypothesis weighting increases detection power in multiple testing , 2015, bioRxiv.

[13]  D. Cox,et al.  Parameter Orthogonality and Approximate Conditional Inference , 1987 .

[14]  Mick Watson,et al.  Errors in RNA-Seq quantification affect genes of relevance to human disease , 2015, Genome Biology.

[15]  E. Spjøtvoll,et al.  Plots of P-values to evaluate many tests simultaneously , 1982 .

[16]  Geet Duggal,et al.  Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-seq Data using Lightweight-Alignment , 2015 .

[17]  Li Yang,et al.  Conservation of an RNA regulatory map between Drosophila and mammals. , 2011, Genome research.

[18]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[19]  Inga-Lena Nilsson,et al.  Evidence of a functional estrogen receptor in parathyroid adenomas. , 2012, The Journal of clinical endocrinology and metabolism.

[20]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[21]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[22]  R. Dennis Cook,et al.  Detection of Influential Observation in Linear Regression , 2000, Technometrics.