Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq

MOTIVATION RNA sequencing (RNA-Seq) is a powerful new technology for mapping and quantifying transcriptomes using ultra high-throughput next-generation sequencing technologies. Using deep sequencing, gene expression levels of all transcripts including novel ones can be quantified digitally. Although extremely promising, the massive amounts of data generated by RNA-Seq, substantial biases and uncertainty in short read alignment pose challenges for data analysis. In particular, large base-specific variation and between-base dependence make simple approaches, such as those that use averaging to normalize RNA-Seq data and quantify gene expressions, ineffective. RESULTS In this study, we propose a Poisson mixed-effects (POME) model to characterize base-level read coverage within each transcript. The underlying expression level is included as a key parameter in this model. Since the proposed model is capable of incorporating base-specific variation as well as between-base dependence that affect read coverage profile throughout the transcript, it can lead to improved quantification of the true underlying expression level. AVAILABILITY AND IMPLEMENTATION POME can be freely downloaded at http://www.stat.purdue.edu/~yuzhu/pome.html. CONTACT yuzhu@purdue.edu; zhaohui.qin@emory.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[2]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[3]  B. Peterson,et al.  Stochastic Approximation Algorithms for Estimation of Spatial Mixed Models , 2007 .

[4]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[5]  John T. Wei,et al.  Transcriptome sequencing across a prostate cancer cohort identifies PCAT-1, an unannotated lincRNA implicated in disease progression , 2011, Nature Biotechnology.

[6]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[7]  P. Diggle,et al.  Model‐based geostatistics , 2007 .

[8]  Lee T. Sam,et al.  A Comparison of Single Molecule and Amplification Based Sequencing of Cancer Transcriptomes , 2011, PloS one.

[9]  Sylvia Richardson,et al.  A comparison of Bayesian spatial models for disease mapping , 2005, Statistical methods in medical research.

[10]  D. Clayton,et al.  Empirical Bayes estimates of age-standardized relative risks for use in disease mapping. , 1987, Biometrics.

[11]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[12]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[13]  Ryan D. Morin,et al.  Application of massively parallel sequencing to microRNA profiling and discovery in human embryonic stem cells. , 2008, Genome research.

[14]  W. Wong,et al.  Modeling non-uniformity in short-read rates in RNA-Seq data , 2010, Genome Biology.

[15]  S. Ranade,et al.  Stem cell transcriptome profiling via massive-scale mRNA sequencing , 2008, Nature Methods.

[16]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[17]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[18]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[19]  J. Besag,et al.  Bayesian image restoration, with two applications in spatial statistics , 1991 .

[20]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[21]  P. Diggle Applied Spatial Statistics for Public Health Data , 2005 .

[22]  W. Gilks,et al.  Adaptive Rejection Metropolis Sampling Within Gibbs Sampling , 1995 .

[23]  R. Lister,et al.  Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis , 2008, Cell.

[24]  Jon Wakefield,et al.  Disease mapping and spatial regression with count data. , 2007, Biostatistics.

[25]  I. Goodhead,et al.  Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution , 2008, Nature.

[26]  By W. R. GILKSt,et al.  Adaptive Rejection Sampling for Gibbs Sampling , 2010 .

[27]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[28]  S. Srivastava,et al.  A two-parameter generalized Poisson model to improve the analysis of RNA-seq data , 2010, Nucleic acids research.

[29]  N. Cressie,et al.  Spatial Modeling of Regional Variables , 1993 .

[30]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[31]  L. Waller,et al.  Applied Spatial Statistics for Public Health Data: Waller/Applied Spatial Statistics , 2004 .

[32]  Lee T. Sam,et al.  Transcriptome Sequencing to Detect Gene Fusions in Cancer , 2009, Nature.