论文信息 - RNA-Seq gene expression estimation with read mapping uncertainty

RNA-Seq gene expression estimation with read mapping uncertainty

Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically. Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20–25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed. Availability: An initial C++ implementation of our method that was used for the results presented in this article is available at http://deweylab.biostat.wisc.edu/rsem. Contact: cdewey@biostat.wisc.edu Supplementary information: Supplementary data are available at Bioinformatics on

[1] Terence P Speed,et al. Statistical modeling of sequencing errors in SAGE libraries. , 2004, Bioinformatics.

[2] Wing Hung Wong,et al. Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[3] M. Stephens,et al. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[4] R. Staden. A strategy of DNA sequencing employing computer programs. , 1979, Nucleic acids research.

[5] Anne Bergeron,et al. Exact Transcriptome Reconstruction from Short Sequence Reads , 2008, WABI.

[6] David Haussler,et al. The UCSC Known Genes , 2006, Bioinform..

[7] R. Lister,et al. Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis , 2008, Cell.

[8] Geoffrey J Faulkner,et al. A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. , 2008, Genomics.

[9] Juliane C. Dohm,et al. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[10] Cole Trapnell,et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[11] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12] M. Gerstein,et al. RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[13] Ryan D. Morin,et al. Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. , 2008, BioTechniques.

[14] M. Gerstein,et al. The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[15] S. Ranade,et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing , 2008, Nature Methods.

[16] B. Williams,et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[17] Wing Hung Wong,et al. Cross-hybridization modeling on Affymetrix exon arrays , 2008, Bioinform..