Modeling non-uniformity in short-read rates in RNA-Seq data

After mapping, RNA-Seq data can be summarized by a sequence of read counts commonly modeled as Poisson variables with constant rates along each transcript, which actually fit data poorly. We suggest using variable rates for different positions, and propose two models to predict these rates based on local sequences. These models explain more than 50% of the variations and can lead to improved estimates of gene and isoform expressions for both Illumina and Applied Biosystems data.

[1]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[2]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[3]  J. Hardin,et al.  Generalized Linear Models and Extensions , 2001 .

[4]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[5]  J. Friedman Stochastic gradient boosting , 2002 .

[6]  Felix Naef,et al.  Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[8]  Rafael A. Irizarry,et al.  A Model-Based Background Adjustment for Oligonucleotide Expression Arrays , 2004 .

[9]  B. Frey,et al.  Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform. , 2004, Molecular cell.

[10]  Greg Ridgeway,et al.  Generalized Boosted Models: A guide to the gbm package , 2006 .

[11]  Michal J. Okoniewski,et al.  Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations , 2006, BMC Bioinformatics.

[12]  Clifford A. Meyer,et al.  Model-based analysis of tiling-arrays for ChIP-chip , 2006, Proceedings of the National Academy of Sciences.

[13]  Thomas E. Royce,et al.  Toward a universal microarray: prediction of gene expression through nearest-neighbor probe sequence identification , 2007, Nucleic acids research.

[14]  Wei Li,et al.  Model-based analysis of two-color arrays (MA2C) , 2007, Genome Biology.

[15]  Dustin P. Potter,et al.  Probe signal correction for differential methylation hybridization experiments , 2008, BMC Bioinformatics.

[16]  Ryan D. Morin,et al.  Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. , 2008, BioTechniques.

[17]  Brian D. Ondov,et al.  Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications , 2008, Bioinform..

[18]  Wing Hung Wong,et al.  SeqMap: mapping massive amount of oligonucleotides to the genome , 2008, Bioinform..

[19]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[20]  S. Ranade,et al.  Stem cell transcriptome profiling via massive-scale mRNA sequencing , 2008, Nature Methods.

[21]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[22]  R. Lister,et al.  Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis , 2008, Cell.

[23]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[24]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[25]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[26]  Robert A Holt,et al.  The new paradigm of flow cell sequencing. , 2008, Genome research.

[27]  I. Goodhead,et al.  Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution , 2008, Nature.

[28]  Wing Hung Wong,et al.  Cross-hybridization modeling on Affymetrix exon arrays , 2008, Bioinform..

[29]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[30]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[31]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[32]  W. J. Kent,et al.  The UCSC Genome Browser , 2003, Current protocols in bioinformatics.

[33]  K. Hansen,et al.  Biases in Illumina transcriptome sequencing caused by random hexamer priming , 2010, Nucleic acids research.