Fast and interpretable alternative splicing and differential gene-level expression analysis using transcriptome segmentation with Yanagi

Introduction: Analysis of differential alternative splicing from RNA-seq data is complicated by the fact that many RNA-seq reads map to multiple transcripts, besides, the annotated transcripts are often a small subset of the possible transcripts of a gene. Here we describe Yanagi, a tool for segmenting transcriptome to create a library of maximal L-disjoint segments from a complete transcriptome annotation. That segment library preserves all transcriptome substrings of length L and transcripts structural relationships while eliminating unnecessary sequence duplications. Contributions: In this paper, we formalize the concept of transcriptome segmentation and propose an efficient algorithm for generating segment libraries based on a length parameter dependent on specific RNA-Seq library construction. The resulting segment sequences can be used with pseudo-alignment tools to quantify expression at the segment level. We characterize the segment libraries for the reference transcriptomes of Drosophila melanogaster and Homo sapiens and provide gene-level visualization of the segments for better interpretability. Then we demonstrate the use of segments-level quantification into gene expression and alternative splicing analysis. The notion of transcript segmentation as introduced here and implemented in Yanagi opens the door for the application of lightweight, ultra-fast pseudo-alignment algorithms in a wide variety of RNA-seq analyses. Conclusion: Using segment library rather than the standard transcriptome succeeds in significantly reducing ambigious alignments where reads are multimapped to several sequences in the reference. That allowed avoiding the quantification step required by standard kmer-based pipelines for gene expression analysis. Moreover, using segment counts as statistics for alternative splicing analysis enables achieving comparable performance to counting-based approaches (e.g. rMATS) while rather using fast and lighthweight pseudo alignment.

[1]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[2]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[3]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[4]  Juan González-Vallinas,et al.  A new view of transcriptome complexity and regulation through the lens of local splicing variations , 2016, eLife.

[5]  Héctor Corrada Bravo,et al.  Yanagi: Transcript Segment Library Construction for RNA-Seq Quantification , 2017, WABI.

[6]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[7]  Lior Pachter,et al.  Gene-level differential analysis at transcript-level resolution , 2017, Genome Biology.

[8]  Juan L Trincado,et al.  SUPPA2 provides fast, accurate, and uncertainty-aware differential splicing analysis across multiple conditions , 2017, bioRxiv.

[9]  T. Blauwkamp,et al.  Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events , 2015, Nature Biotechnology.

[10]  Robert Patro,et al.  RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes , 2015, bioRxiv.

[11]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[12]  Geoffrey J. Barton,et al.  Identifying differential isoform abundance with RATs: a universal tool and a warning , 2017, bioRxiv.

[13]  W. Huber,et al.  Detecting differential usage of exons from RNA-seq data , 2012, Genome research.

[14]  Stephen M. Mount,et al.  Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. , 2003, Nucleic acids research.

[15]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[16]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[17]  Daisuke Hattori,et al.  Got diversity? Wiring the fly brain with Dscam. , 2006, Trends in biochemical sciences.

[18]  Michael B. Black,et al.  IVT-seq reveals extreme bias in RNA sequencing , 2014, Genome Biology.

[19]  Matthias Heinig,et al.  Alternative Splicing Signatures in RNA‐seq Data: Percent Spliced in (PSI) , 2015, Current protocols in human genetics.

[20]  Mark D. Robinson,et al.  Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage , 2016, Genome Biology.

[21]  Alyssa C. Frazee,et al.  Polyester: Simulating RNA-Seq Datasets With Differential Transcript Expression , 2014, bioRxiv.

[22]  Julie A. Dickerson,et al.  Comparisons of computational methods for differential alternative splicing detection using RNA-seq in plant systems , 2014, BMC Bioinformatics.

[23]  Gerald Hysenaj,et al.  Fast and accurate differential splicing analysis across multiple conditions with replicates , 2016 .

[24]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[25]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[26]  R. Irizarry,et al.  Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation , 2015, Nature Biotechnology.

[27]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[28]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[29]  Lan Lin,et al.  rMATS: Robust and flexible detection of differential alternative splicing from replicate RNA-Seq data , 2014, Proceedings of the National Academy of Sciences.

[30]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[31]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[32]  Derek Y. Chiang,et al.  DiffSplice: the genome-wide detection of differential splicing events with RNA-seq , 2012, Nucleic acids research.

[33]  Eduardo Eyras,et al.  SUPPA: a super-fast pipeline for alternative splicing analysis from RNA-Seq , 2014 .

[34]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[35]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.