LongTron: Automated Analysis of Long Read Spliced Alignment Accuracy

Motivation Long read sequencing has increased the accuracy and completeness of assemblies of various organisms’ genomes in recent months. Similarly, spliced alignments of long read RNA sequencing hold the promise of delivering much longer transcripts of existing and novel isoforms in known genes without the need for error-prone transcript assemblies from short reads. However, low coverage and high-error rates potentially hamper the widespread adoption of long-read spliced alignments in annotation updates and isoform-level expression quantifications. Results Addressing these issues, we first develop a simulation of error modes for both Oxford Nanopore and PacBio CCS spliced-alignments. Based on this we train a Random Forest classifier to assign new long-read alignments to one of two error categories, a novel category, or label them as non-error. We use this classifier to label reads from the spliced-alignments of the popular aligner minimap2, run on three long read sequencing datasets, including NA12878 from Oxford Nanopore and PacBio CCS, as well as a PacBio SKBR3 cancer cell line. Finally, we compare the intron chains of the three long read alignments against individual splice sites, short read assemblies, and the output from the FLAIR pipeline on the same samples. Our results demonstrate a substantial lack of precision in determining exact splice sites for long reads during alignment on both platforms while showing some benefit from postprocessing. This work motivates the need for both better aligners and additional post-alignment processing to adjust incorrectly called putative splice-sites and clarify novel transcripts support. Availability and implementation Source code for the random forest implemented in python is available at https://github.com/schatzlab/LongTron under the MIT license. The modified version of GffCompare used to construct Table 3 and related is here: https://github.com/ChristopherWilks/gffcompare/releases/tag/0.11.2LT Supplementary Information Supplementary notes and figures are available online.

[1]  Richard M. Karp,et al.  Faster and More Accurate Sequence Alignment with SNAP , 2011, ArXiv.

[2]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[3]  F. Balloux,et al.  Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast , 2016, Nature Communications.

[4]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[5]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[6]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[7]  Thomas R. Gingeras,et al.  STAR: ultrafast universal RNA-seq aligner , 2013, Bioinform..

[8]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[9]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[10]  Robert Gentleman,et al.  Prediction and Quantification of Splice Events from RNA-Seq Data , 2016, PloS one.

[11]  James Taylor,et al.  The full-length transcriptome of C. elegans using direct RNA sequencing , 2019, bioRxiv.

[12]  B. Haas,et al.  Accuracy assessment of fusion transcript detection via read-mapping and de novo fusion transcript assembly-based methods , 2019, Genome Biology.

[13]  David Haussler,et al.  The UCSC Genome Browser database: 2019 update , 2018, Nucleic Acids Res..

[14]  Xiandong Meng,et al.  Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing , 2015, PloS one.

[15]  Alexander Hoischen,et al.  Long-Read Sequencing Emerging in Medical Genetics , 2019, Front. Genet..

[16]  C. Thermes,et al.  Ten years of next-generation sequencing technology. , 2014, Trends in genetics : TIG.

[17]  James Taylor,et al.  The full-length transcriptome of C. elegans using direct RNA sequencing. , 2020, Genome research.

[18]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[19]  Angela N. Brooks,et al.  Nanopore native RNA sequencing of a human poly(A) transcriptome , 2018, bioRxiv.

[20]  G. Pertea,et al.  GFF Utilities: GffRead and GffCompare. , 2020, F1000Research.

[21]  Luke Zappia,et al.  Opportunities and challenges in long-read sequencing data analysis , 2020, Genome Biology.

[22]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[23]  Ben Langmead,et al.  A tandem simulation framework for predicting mapping quality , 2017, Genome Biology.

[24]  Michael C. Schatz,et al.  Vargas: heuristic-free alignment for assessing linear and graph read aligners , 2019, bioRxiv.

[25]  Angela N. Brooks,et al.  Full-length transcript characterization of SF3B1 mutation in chronic lymphocytic leukemia reveals downregulation of retained introns , 2018, Nature Communications.

[26]  A. Alavi,et al.  Opportunities and Challenges , 1998, In Vitro Diagnostic Industry in China.

[27]  Lennart Martens,et al.  1 SQANTI : extensive characterization of long read transcript sequences for quality control in 1 full-length transcriptome identification and quantification 2 3 , 2017 .

[28]  Steven L Salzberg,et al.  Transcriptome assembly from long-read RNA-seq alignments with StringTie2 , 2019, Genome biology.

[29]  Geo Pertea,et al.  Transcriptome assembly from long-read RNA-seq alignments with StringTie2 , 2019, Genome Biology.

[30]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.