Terminus enables the discovery of data-driven, robust transcript groups from RNA-seq data

Motivation Advances in sequencing technology, inference algorithms and differential testing methodology have enabled transcript-level analysis of RNA-seq data. Yet, the inherent inferential uncertainty in transcriptlevel abundance estimation, even among the most accurate approaches, means that robust transcript-level analysis often remains a challenge. Conversely, gene-level analysis remains a common and robust approach for understanding RNA-seq data, but it coarsens the resulting analysis to the level of genes, even if the data strongly support specific transcript-level effects. Results We introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. Our approach, implemented in the tool terminus, groups together transcripts in a data-driven manner allowing transcript-level analysis where it can be confidently supported, and deriving transcriptional groups where the inferential uncertainty is too high to support a transcript-level result. Availability Terminus is implemented in Rust, and is freely-available and open-source. It can be obtained from https://github.com/COMBINE-lab/Terminus. Contact rob@cs.umd.edu Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Antti Honkela,et al.  Identifying differentially expressed transcripts from RNA-seq data with biological variation , 2011, Bioinform..

[2]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[3]  Li Yang,et al.  Conservation of an RNA regulatory map between Drosophila and mammals. , 2011, Genome research.

[4]  Joseph G Ibrahim,et al.  Nonparametric expression analysis using inferential replicate counts , 2019, bioRxiv.

[5]  Alyssa C. Frazee,et al.  Polyester: Simulating RNA-Seq Datasets With Differential Transcript Expression , 2014, bioRxiv.

[6]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[7]  Charlotte Soneson,et al.  Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification , 2018, F1000Research.

[8]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[9]  Lior Pachter,et al.  Differential analysis of RNA-seq incorporating quantification uncertainty , 2016, Nature Methods.

[10]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[11]  L. Coin,et al.  Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads , 2011, Genome Biology.

[12]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[13]  Michael Garland,et al.  Surface simplification using quadric error metrics , 1997, SIGGRAPH.

[14]  Fatemeh Almodaresi,et al.  Improved data-driven likelihood factorizations for transcript abundance estimation , 2017, Bioinform..

[15]  Ernest Turro,et al.  Flexible analysis of RNA-seq data using mixed effects models , 2014, Bioinform..

[16]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[17]  Mitsuo Iwadate,et al.  TINAGL1 and B3GALNT1 are potential therapy target genes to suppress metastasis in non-small cell lung cancer , 2014, BMC Genomics.

[18]  Robert E. Tarjan,et al.  Three Partition Refinement Algorithms , 1987, SIAM J. Comput..

[19]  Ion I Măndoiu,et al.  Bootstrap-based differential gene expression analysis for RNA-Seq data with and without replicates , 2014, BMC Genomics.

[20]  Qi Zhou,et al.  Alternative Splicing within and between Drosophila Species, Sexes, Tissues, and Developmental Stages , 2016, PLoS genetics.

[21]  Gary A. Churchill,et al.  Hierarchical analysis of RNA‐seq reads improves the accuracy of allele‐specific expression , 2018, Bioinform..

[22]  Faraz Hach,et al.  ORMAN: Optimal resolution of ambiguous RNA-Seq multimappings in the presence of novel isoforms , 2014, Bioinform..

[23]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[24]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[25]  Mick Watson,et al.  Errors in RNA-Seq quantification affect genes of relevance to human disease , 2015, Genome Biology.

[26]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[27]  R. Irizarry,et al.  Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation , 2015, Nature Biotechnology.

[28]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.