A Novel Combinatorial Method for Estimating Transcript Expression with RNA-Seq: Bounding the Number of Paths

RNA-Seq technology offers new high-throughput ways for transcript identification and quantification based on short reads, and has recently attracted great interest. The problem is usually modeled by a weighted splicing graph whose nodes stand for exons and whose edges stand for split alignments to the exons. The task consists of finding a number of paths, together with their expression levels, which optimally explain the coverages of the graph under various fitness functions, such least sum of squares. In (Tomescu et al. RECOMB-seq 2013) we showed that under general fitness functions, if we allow a polynomially bounded number of paths in an optimal solution, this problem can be solved in polynomial time by a reduction to a min-cost flow program. In this paper we further refine this problem by asking for a bounded number k of paths that optimally explain the splicing graph. This problem becomes NP-hard in the strong sense, but we give a fast combinatorial algorithm based on dynamic programming for it. In order to obtain a practical tool, we implement three optimizations and heuristics, which achieve better performance on real data, and similar or better performance on simulated data, than state-of-the-art tools Cufflinks, IsoLasso and SLIDE. Our tool, called Traph, is available at http://www.cs.helsinki.fi/gsa/traph/ .

[1]  Yi Xing,et al.  The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. , 2004, Genome research.

[2]  James B. Brown,et al.  Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation , 2011, Proceedings of the National Academy of Sciences.

[3]  Fatih Ozsolak,et al.  RNA sequencing: advances, challenges and opportunities , 2011, Nature Reviews Genetics.

[4]  Haixu Tang,et al.  Splicing graphs and EST assembly problem , 2002, ISMB.

[5]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[6]  Orion J. Buske,et al.  iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data , 2013, Genome research.

[7]  Xuegong Zhang,et al.  Isoform reconstruction using short RNA-Seq reads by maximum likelihood is NP-hard , 2013, 1305.0916.

[8]  Faraz Hach,et al.  CLIIQ: Accurate Comparative Detection and Quantification of Expressed Isoforms in a Population , 2012, WABI.

[9]  Xiaobo Zhou,et al.  NSMAP: A method for spliced isoforms identification and quantification from RNA-Seq , 2011, BMC Bioinformatics.

[10]  Alexandru I. Tomescu,et al.  A novel min-cost flow method for estimating transcript expression with RNA-Seq , 2013, BMC Bioinformatics.

[11]  Tao Jiang,et al.  Inference of Isoforms from Short Sequence Reads , 2010, RECOMB.

[12]  Eyras Eduardo,et al.  Methods to Study Splicing from RNA-Seq , 2013 .

[13]  Peter G. M. van der Heijden,et al.  Estimating the Size of a Criminal Population from Police Records Using the Truncated Poisson Regression Model , 2003 .

[14]  Bosiljka Tasic,et al.  Alternative pre-mRNA splicing and proteome expansion in metazoans , 2002, Nature.

[15]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[16]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[17]  Julien Mairal,et al.  Efficient RNA isoform identification and quantification from RNA-Seq data with network flows , 2014, Bioinform..

[18]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[19]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[20]  Philippe Chrétienne,et al.  Simple bounds and greedy algorithms for decomposing a flow into a minimal set of paths , 2008, Eur. J. Oper. Res..

[21]  Tao Jiang,et al.  IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly - (Extended Abstract) , 2011, RECOMB.

[22]  Eugene W. Myers A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming , 1998, CPM.

[23]  Kenneth K. Lopiano,et al.  RNA-seq: technical variability and sampling , 2011, BMC Genomics.

[24]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[25]  Wing Hung Wong,et al.  Simultaneous Isoform Discovery and Quantification from RNA-Seq , 2013, Statistics in biosciences.

[26]  P. Bork,et al.  Alternative splicing and genome complexity , 2002, Nature Genetics.

[27]  Dumitru Brinza,et al.  An integer programming approach to novel transcript reconstruction from paired-end RNA-Seq reads , 2012, BCB.