Improved transcript isoform discovery using ORF graphs

MOTIVATION High-throughput sequencing of RNA in vivo facilitates many applications, not the least of which is the cataloging of variant splice isoforms of protein-coding messenger RNAs. Although many solutions have been proposed for reconstructing putative isoforms from deep sequencing data, these generally take as their substrate the collective alignment structure of RNA-seq reads and ignore the biological signals present in the actual nucleotide sequence. The majority of these solutions are graph-theoretic, relying on a splice graph representing the splicing patterns and exon expression levels indicated by the spliced-alignment process. RESULTS We show how to augment splice graphs with additional information reflecting the biology of transcription, splicing and translation, to produce what we call an ORF (open reading frame) graph. We then show how ORF graphs can be used to produce isoform predictions with higher accuracy than current state-of-the-art approaches. AVAILABILITY AND IMPLEMENTATION RSVP is available as C++ source code under an open-source licence: http://ohlerlab.mdc-berlin.de/software/RSVP/.

[1]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[2]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[3]  Tom Maniatis,et al.  A role for exon sequences and splice-site proximity in splice-site selection , 1986, Cell.

[4]  Yamile Marquez,et al.  Transcriptome survey reveals increased complexity of the alternative splicing landscape in Arabidopsis , 2012, Genome research.

[5]  M. Gouy,et al.  Codon catalog usage and the genome hypothesis. , 1980, Nucleic acids research.

[6]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[7]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[8]  P. Benfey,et al.  Integrated detection of natural antisense transcripts using strand-specific RNA sequencing data , 2013, Genome research.

[9]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[10]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[11]  M. Kozak An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. , 1987, Nucleic acids research.

[12]  Orion J. Buske,et al.  iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data , 2013, Genome research.

[13]  Pramod Kaushik Mudrakarta,et al.  mTim: Rapid and accurate transcript reconstruction from RNA-Seq data , 2013, 1309.5211.

[14]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[15]  Steven Salzberg,et al.  Efficient decoding algorithms for generalized hidden Markov model gene finders , 2005, BMC Bioinformatics.

[16]  M. Kozak The scanning model for translation: an update , 1989, The Journal of cell biology.

[17]  N. Chua,et al.  Genome-Wide Analysis Uncovers Regulation of Long Intergenic Noncoding RNAs in Arabidopsis[C][W] , 2012, Plant Cell.

[18]  Steven Salzberg,et al.  TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders , 2004, Bioinform..

[19]  G. C. Roberts,et al.  Role of an inhibitory pyrimidine element and polypyrimidine tract binding protein in repression of a regulated alpha-tropomyosin exon. , 1969, RNA.

[20]  William H. Majoros,et al.  Methods for computational gene prediction , 2007 .

[21]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[22]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[23]  Christian G Elowsky,et al.  Participation of Leaky Ribosome Scanning in Protein Dual Targeting by Alternative Translation Initiation in Higher Plants[W][OA] , 2009, The Plant Cell Online.

[24]  Jonathan E. Allen,et al.  JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions , 2006, Genome Biology.