Fleximer: Accurate Quantification of RNA-Seq via Variable-Length k-mers

The advent of RNA-Seq has made it possible to quantify transcript expression on a large scale simultaneously. This technology generates small fragments of each transcript sequence, known as sequencing reads. As the first step of data analysis towards expression quantification, most of the existing methods align these reads to a reference genome or transcriptome to establish their origins. However, read alignment is computationally costly. Recently, a series of methods have been proposed to perform a lightweight quantification analysis in an alignment-free manner. These methods utilize the notion of k-mers, which are short consecutive sequences representing the signatures of each transcript, to estimate the relative abundance from RNA-Seq reads. Current k-mer based approaches make use of a set of fixed size k-mers; however, the true signatures of each transcript may not exist in a fixed size. In this paper, we demonstrate the importance of k-mers selection in transcript abundance estimation. We propose a novel method, Fleximer, to efficiently discover and select an optimal set of k-mers with flexible lengths. Using both simulated and real datasets, we show that, with fewer k-mers, Fleximer is able to cover the similar amount of reads as Sailfish and Kallisto. The selected k-mers own more distinguishing features, and thus substantially reduce the errors in transcript abundance estimation.

[1]  Michael Sammeth,et al.  Complete Alternative Splicing Events Are Bubbles in Splicing Graphs , 2009, J. Comput. Biol..

[2]  Robert Patro,et al.  RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq reads to transcriptomes , 2015, bioRxiv.

[3]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[4]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[5]  VälimäkiNiko,et al.  Compressed suffix tree---a basis for genome-scale sequence analysis , 2007 .

[6]  Cole Trapnell,et al.  Computational methods for transcriptome annotation and quantification using RNA-seq , 2011, Nature Methods.

[7]  Nuno A. Fonseca,et al.  Expression Atlas update—an integrated database of gene and protein expression in humans, animals and plants , 2015, Nucleic Acids Res..

[8]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[9]  H. Koltai,et al.  Specificity of DNA microarray hybridization: characterization, effectors and approaches for data correction , 2008, Nucleic acids research.

[10]  Stefan Kurtz,et al.  REPuter: fast computation of maximal repeats in complete genomes , 1999, Bioinform..

[11]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[12]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[13]  Martin Farach-Colton,et al.  Optimal Suffix Tree Construction with Large Alphabets , 1997, FOCS.

[14]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[15]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[16]  Xerox Polo,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976 .

[17]  Enno Ohlebusch,et al.  Efficient multiple genome alignment , 2002, ISMB.

[18]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[19]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[20]  John Riedl,et al.  Generalized suffix trees for biological sequence data: applications and implementation , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[21]  Wei Wang,et al.  RNA-Skim: a rapid method for RNA-Seq quantification at transcript level , 2014, Bioinform..

[22]  Alyssa C. Frazee,et al.  Polyester: Simulating RNA-Seq Datasets With Differential Transcript Expression , 2014, bioRxiv.

[23]  Philip E. Burian,et al.  Principles Driven Leadership: Thoughts, Observations And Conceptual Model , 2013, BIOINFORMATICS 2013.

[24]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[25]  Alexander Schliep,et al.  Selecting signature oligonucleotides to identify organisms using DNA arrays , 2002, Bioinform..

[26]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[27]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[28]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[29]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[30]  B. Wilhelm,et al.  RNA-Seq-quantitative measurement of expression through massively parallel RNA-sequencing. , 2009, Methods.