Exact Transcriptome Reconstruction from Short Sequence Reads

In this paper we address the problem of characterizing the RNA complement of a given cell type, that is, the set of RNA species and their relative copy number, from a large set of short sequence reads which have been randomly sampled from the cell's RNA sequences through a sequencing experiment. We refer to this problem as the transcriptome reconstruction problem, and we specifically investigate, both theoretically and practically, the conditions under which the problem can be solved. We demonstrate that, even under the assumption of exact information, neither single read nor paired-end read sequences guarantee theoretically that the reconstruction problem has a unique solution. However, by investigating the behavior of the best annotated human gene set, we also show that, in practice, paired-end reads --- but not single reads --- may be sufficient to solve the vast majority of the transcript variants species and abundances. We finally show that, when we assume that the RNA species existing in the cell are known, single read sequences can effectively be used to infer transcript variant abundances.

[1]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[2]  Anne Bergeron,et al.  Wheat EST resources for functional genomics of abiotic stress , 2006, BMC Genomics.

[3]  David Haussler,et al.  Transcriptome and Genome Conservation of Alternative Splicing Events in Humans and Mice , 2003, Pacific Symposium on Biocomputing.

[4]  Gabriel Valiente,et al.  Bubbles: Alternative Splicing Events of Arbitrary Dimension in Splicing Graphs , 2008, RECOMB.

[5]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[6]  Christopher J. Lee,et al.  Genome-wide detection of alternative splicing in expressed sequences of human genes , 2001, Nucleic Acids Res..

[7]  Yi Xing,et al.  An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs , 2006, Nucleic acids research.

[8]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[9]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[10]  Bernd Weisshaar,et al.  EST Sequencing, Annotation and Macroarray Transcriptome Analysis Identify Preferentially Root-Expressed Genes in Sugar Beet , 2002 .

[11]  Yi Xing,et al.  The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. , 2004, Genome research.

[12]  J. Ohlrogge,et al.  Sampling the Arabidopsis Transcriptome with Massively Parallel Pyrosequencing1[W][OA] , 2007, Plant Physiology.

[13]  M. Gelfand,et al.  Frequent alternative splicing of human genes. , 1999, Genome research.

[14]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[15]  Sylvain Foissac,et al.  A General Definition and Nomenclature for Alternative Splicing Events , 2008, PLoS Comput. Biol..

[16]  Clive Brown,et al.  Toward the $1000 human genome , 2005 .

[17]  Clive Brown,et al.  Toward the 1,000 dollars human genome. , 2005, Pharmacogenomics.

[18]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[19]  A. Kerlavage,et al.  Complementary DNA sequencing: expressed sequence tags and human genome project , 1991, Science.

[20]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.

[21]  Gunnar Rätsch,et al.  Optimal spliced alignments of short sequence reads , 2008, BMC Bioinformatics.

[22]  Karl F Hoffmann,et al.  Characterization of the Schistosoma transcriptome opens up the world of helminth genomics , 2003, Genome Biology.