A Hierarchical Bayesian Model for RNA-Seq Data

In the last few years, RNA-Seq has become a popular choice for high-throughput studies of gene expression, revealing its potential to overcome microarrays and become the new standard for transcriptional profiling. At a gene-level, RNA-Seq yields counts rather than continuous measures of expression, leading to the need for novel methods to deal with count data in high-dimensional problems.We present a hierarchical Bayesian approach to the modeling of RNA-Seq data. The model accounts for the difference in the total number of counts in the different samples (sequencing depth), as well as for overdispersion, with no need to transform the data prior to the analysis. Using an MCMC algorithm, we identify differentially expressed genes, showing promising results both on simulated and on real data, compared to those of edgeR and DESeq (state-of-the-art algorithms for RNA-Seq data analysis).

[1]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[2]  F. J. Anscombe,et al.  Sampling theory of the negative binomial and logarithmic series distributions. , 1950, Biometrika.

[3]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[4]  A. Conesa,et al.  Differential expression in RNA-seq: a matter of depth. , 2011, Genome research.

[5]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[6]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[7]  Pierre Baldi,et al.  A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes , 2001, Bioinform..

[8]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[9]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[10]  Sandrine Dudoit,et al.  Novel Low Abundance and Transient RNAs in Yeast Revealed by Tiling Microarrays and Ultra High–Throughput Sequencing Are Not Conserved Across Closely Related Yeast Species , 2008, PLoS genetics.

[11]  Xuegong Zhang,et al.  DEGseq: an R package for identifying differentially expressed genes from RNA-seq data , 2010, Bioinform..

[12]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[13]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[14]  Martyn Plummer,et al.  JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling , 2003 .

[15]  W. Huber,et al.  Detecting differential usage of exons from RNA-seq data , 2012, Genome research.

[16]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[17]  Peter Müller,et al.  On Differential Gene Expression Using RNA-Seq Data , 2011, Cancer informatics.

[18]  Annelise E Barron,et al.  Advantages and limitations of next‐generation sequencing technologies: A comparison of electrophoresis and non‐electrophoresis methods , 2008, Electrophoresis.

[19]  H. Rue,et al.  Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations , 2009 .

[20]  J. Lawless Negative binomial and mixed Poisson regression , 1987 .

[21]  C M Kendziorski,et al.  On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles , 2003, Statistics in medicine.

[22]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[23]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[24]  A. Oshlack,et al.  Transcript length bias in RNA-seq data confounds systems biology , 2009, Biology Direct.

[25]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[26]  J. Ibrahim,et al.  Bayesian Models for Gene Expression With DNA Microarray Data , 2002 .

[27]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[28]  K. Hansen,et al.  Biases in Illumina transcriptome sequencing caused by random hexamer priming , 2010, Nucleic acids research.

[29]  Robert A Holt,et al.  The new paradigm of flow cell sequencing. , 2008, Genome research.

[30]  M. Bulmer On Fitting the Poisson Lognormal Distribution to Species-Abundance Data , 1974 .

[31]  Zhijin Wu,et al.  Empirical bayes analysis of sequencing-based transcriptional profiling without replicates , 2010, BMC Bioinformatics.

[32]  Gordon K. Smyth,et al.  Testing significance relative to a fold-change threshold is a TREAT , 2009, Bioinform..

[33]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[34]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.