CLASS: Accurate and Efficient Splice Variant Annotation from RNA-seq Reads

Next generation sequencing of cellular RNA is making it possible to characterize genes and alternative splicing in unprecedented detail. However, designing bioinformatics tools to capture splicing variation accurately has proven difficult. Current programs find major isoforms of a gene but miss finer splicing variations, or are sensitive but highly imprecise. We present CLASS, a novel open source tool for accurate genome-guided transcriptome assembly from RNA-seq reads. CLASS employs a splice graph to represent a gene and its splice variants, combined with a linear program to determine an accurate set of exons and efficient splice graph-based transcript selection algorithms. When compared against reference programs, CLASS had the best overall accuracy and could detect up to twice as many splicing events with precision similar to the best reference program. Notably, it was the only tool that produced consistently reliable transcript models for a wide range of applications and sequencing strategies, including very large data sets and ribosomal RNA-depleted samples. Lightweight and multi-threaded, CLASS required <3GB RAM and less than one day to analyze a 350 million read set, and is an excellent choice for transcriptomics studies, from clinical RNA sequencing, to alternative splicing analyses, and to the annotation of new genomes.

[1]  A. Ben-Hur,et al.  METHOD Open Access , 2014 .

[2]  Tao Jiang,et al.  IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly - (Extended Abstract) , 2011, RECOMB.

[3]  L. Feuk,et al.  Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain , 2011, Nature Structural &Molecular Biology.

[4]  Nadav S. Bar,et al.  Landscape of transcription in human cells , 2012, Nature.

[5]  David G. Knowles,et al.  Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs , 2012, Genome research.

[6]  G. Sutton,et al.  Gene and alternative splicing annotation with AIR. , 2005, Genome research.

[7]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[8]  James B. Brown,et al.  Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation , 2011, Proceedings of the National Academy of Sciences.

[9]  R. Guigó,et al.  Modelling and simulating generic RNA-Seq experiments with the flux simulator , 2012, Nucleic acids research.

[10]  Joseph K. Pickrell,et al.  Noisy Splicing Drives mRNA Isoform Diversity in Human Cells , 2010, PLoS genetics.

[11]  B. Frey,et al.  Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing , 2008, Nature Genetics.

[12]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[13]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[14]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[15]  T. Cooper,et al.  Pre-mRNA splicing in disease and therapeutics. , 2012, Trends in molecular medicine.

[16]  S. Salzberg,et al.  Thousands of exon skipping events differentiate among splicing patterns in sixteen human tissues , 2013, F1000Research.

[17]  E. Eichler,et al.  Simultaneous structural variation discovery among multiple paired-end sequenced genomes. , 2011, Genome research.

[18]  R. Lothe,et al.  Transcriptome instability as a molecular pan-cancer characteristic of carcinomas , 2014, BMC Genomics.

[19]  S. Fuqua,et al.  RNA sequencing of cancer reveals novel splicing alterations , 2013, Scientific Reports.

[20]  T. Babak,et al.  A quantitative atlas of polyadenylation in five mammals , 2012, Genome research.

[21]  Michael B. Black,et al.  IVT-seq reveals extreme bias in RNA sequencing , 2014, Genome Biology.

[22]  Li Song,et al.  CLASS: constrained transcript assembly of RNA-seq reads , 2013, BMC Bioinformatics.

[23]  S. Stamm,et al.  Function of alternative splicing. , 2013, Gene.

[24]  Marco Beccuti,et al.  Alternative splicing detection workflow needs a careful combination of sample prep and bioinformatics analysis , 2014, BMC Bioinformatics.

[25]  Steven Salzberg,et al.  Genome-Guided Transcriptome Assembly in the Age of Next-Generation Sequencing , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  R. Elkon,et al.  Alternative cleavage and polyadenylation: extent, regulation and function , 2013, Nature Reviews Genetics.

[27]  Á. Rubio,et al.  Identification of alternative splicing events regulated by the oncogenic factor SRSF1 in lung cancer. , 2014, Cancer research.

[28]  Gil Ast,et al.  Alternative splicing and disease , 2008, RNA biology.

[29]  J. Harrow,et al.  Assessment of transcript reconstruction methods for RNA-seq , 2013, Nature Methods.

[30]  H. Ooi,et al.  Genome-wide profiling of untranslated regions by paired-end ditag sequencing reveals unexpected transcriptome complexity in yeast , 2015, Molecular Genetics and Genomics.

[31]  Gunnar Rätsch,et al.  MITIE: Simultaneous RNA-Seq-based transcript identification and quantification in multiple samples , 2013, Bioinform..

[32]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[33]  K. Nishida,et al.  Mechanisms and consequences of alternative polyadenylation. , 2011, Molecules and Cells.

[34]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[35]  Orion J. Buske,et al.  iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data , 2013, Genome research.