IQSeq: Integrated Isoform Quantification Analysis Based on Next-Generation Sequencing

With the recent advances in high-throughput RNA sequencing (RNA-Seq), biologists are able to measure transcription with unprecedented precision. One problem that can now be tackled is that of isoform quantification: here one tries to reconstruct the abundances of isoforms of a gene. We have developed a statistical solution for this problem, based on analyzing a set of RNA-Seq reads, and a practical implementation, available from archive.gersteinlab.org/proj/rnaseq/IQSeq, in a tool we call IQSeq (Isoform Quantification in next-generation Sequencing). Here, we present theoretical results which IQSeq is based on, and then use both simulated and real datasets to illustrate various applications of the tool. In order to measure the accuracy of an isoform-quantification result, one would try to estimate the average variance of the estimated isoform abundances for each gene (based on resampling the RNA-seq reads), and IQSeq has a particularly fast algorithm (based on the Fisher Information Matrix) for calculating this, achieving a speedup of times compared to brute-force resampling. IQSeq also calculates an information theoretic measure of overall transcriptome complexity to describe isoform abundance for a whole experiment. IQSeq has many features that are particularly useful in RNA-Seq experimental design, allowing one to optimally model the integration of different sequencing technologies in a cost-effective way. In particular, the IQSeq formalism integrates the analysis of different sample (i.e. read) sets generated from different technologies within the same statistical framework. It also supports a generalized statistical partial-sample-generation function to model the sequencing process. This allows one to have a modular, “plugin-able” read-generation function to support the particularities of the many evolving sequencing technologies.

[1]  Maqc Consortium The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements , 2006, Nature Biotechnology.

[2]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[3]  M. Gerstein,et al.  Unlocking the secrets of the genome , 2009, Nature.

[4]  P. Green,et al.  Massively parallel sequencing of the polyadenylated transcriptome of C. elegans. , 2009, Genome research.

[5]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[6]  M. Gerstein,et al.  What is a gene, post-ENCODE? History and updated definition. , 2007, Genome research.

[7]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[8]  R. Roberts,et al.  One predominant 5′-undecanucleotide in adenovirus 2 late messenger RNAs , 1977, Cell.

[9]  R. Roberts,et al.  An amazing sequence arrangement at the 5' ends of adenovirus 2 messenger RNA. 1977. , 2000, Reviews in medical virology.

[10]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[11]  Raymond K. Auerbach,et al.  Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project , 2010, Science.

[12]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[13]  Anne Bergeron,et al.  Exact Transcriptome Reconstruction from Short Sequence Reads , 2008, WABI.

[14]  W. Wong,et al.  Modeling non-uniformity in short-read rates in RNA-Seq data , 2010, Genome Biology.

[15]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[16]  P. Sharp,et al.  Spliced segments at the 5′ terminus of adenovirus 2 late mRNA* , 1977, Proceedings of the National Academy of Sciences.

[17]  K. Hansen,et al.  Biases in Illumina transcriptome sequencing caused by random hexamer priming , 2010, Nucleic acids research.

[18]  R. Roberts,et al.  An amazing sequence arrangement at the 5′ ends of adenovirus 2 messenger RNA , 1977, Cell.

[19]  Yi Xing,et al.  An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs , 2006, Nucleic acids research.

[20]  A. V. D. Vaart,et al.  Asymptotic Statistics: Frontmatter , 1998 .

[21]  Mark Gerstein,et al.  Integrating Sequencing Technologies in Personal Genomics: Optimal Low Cost Reconstruction of Structural Variants , 2009, PLoS Comput. Biol..

[22]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[23]  Marcel H. Schulz,et al.  Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments , 2010, Nucleic acids research.

[24]  M. Schervish Theory of Statistics , 1995 .