SASeq: A Selective and Adaptive Shrinkage Approach to Detect and Quantify Active Transcripts using RNA-Seq

Identification and quantification of condition-specific transcripts using RNA-Seq is vital in transcriptomics research. While initial efforts using mathematical or statistical modeling of read counts or per-base exonic signal have been successful, they may suffer from model overfitting since not all the reference transcripts in a database are expressed under a specific biological condition. Standard shrinkage approaches, such as Lasso, shrink all the transcript abundances to zero in a non-discriminative manner. Thus it does not necessarily yield the set of condition-specific transcripts. Informed shrinkage approaches, using the observed exonic coverage signal, are thus desirable. Motivated by ubiquitous uncovered exonic regions in RNA-Seq data, termed as "naked exons", we propose a new computational approach that first filters out the reference transcripts not supported by splicing and paired-end reads, then followed by fitting a new mathematical model of per-base exonic coverage signal and the underlying transcripts structure. We introduce a tuning parameter to penalize the specific regions of the selected transcripts that were not supported by the naked exons. Our approach compares favorably with the selected competing methods in terms of both time complexity and accuracy using simulated and real-world data. Our method is implemented in SAMMate, a GUI software suite freely available from this http URL

[1]  Walter L. Ruzzo,et al.  A new approach to bias correction in RNA-Seq , 2012, Bioinform..

[2]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[3]  Cole Trapnell,et al.  Improving RNA-Seq expression estimates by correcting for fragment bias , 2011, Genome Biology.

[4]  N. Deng,et al.  Isoform-level microRNA-155 target prediction using RNA-seq , 2011, Nucleic acids research.

[5]  I. Goodhead,et al.  Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution , 2008, Nature.

[6]  Tao Jiang,et al.  IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly - (Extended Abstract) , 2011, RECOMB.

[7]  Xuegong Zhang,et al.  Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq , 2011, Bioinform..

[8]  G. Wahba,et al.  A NOTE ON THE LASSO AND RELATED PROCEDURES IN MODEL SELECTION , 2006 .

[9]  Yi Xing,et al.  An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs , 2006, Nucleic acids research.

[10]  R. Lister,et al.  Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis , 2008, Cell.

[11]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[12]  Tao Jiang,et al.  Workshop: Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads , 2012, 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS).

[13]  J. N. MacLeod,et al.  Genome Sequence, Comparative Analysis, and Population Genetics of the Domestic Horse , 2009, Science.

[14]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[15]  Elizabeth Guruceaga,et al.  SPACE: an algorithm to predict and quantify alternatively spliced isoforms using microarrays , 2008, Genome Biology.

[16]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[17]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[18]  Misko Dzamba,et al.  Detecting copy number variation with mated short reads. , 2010, Genome research.

[19]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[20]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[21]  Ion I. Mandoiu,et al.  Estimation of Alternative Splicing isoform Frequencies from RNA-Seq Data , 2010, WABI.

[22]  Kwong-Sak Leung,et al.  ABMapper: a suffix array-based tool for multi-location searching and splice-junction mapping , 2010, Bioinform..

[23]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[24]  N. Friedman,et al.  Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data , 2011, Nature Biotechnology.

[25]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[26]  Nan Deng,et al.  Transcriptome and targetome analysis in MIR155 expressing cells using RNA-seq. , 2010, RNA.

[27]  L. Feuk,et al.  Global and unbiased detection of splice junctions from RNA-seq data , 2010, Genome Biology.

[28]  Stephen J. Wright,et al.  Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[29]  Derek Y. Chiang,et al.  MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery , 2010, Nucleic acids research.

[30]  Yu-Wei Wu,et al.  Stitching gene fragments with a network matching algorithm improves gene assembly for metagenomics , 2012, Bioinform..

[31]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[32]  E. Wang,et al.  Analysis and design of RNA sequencing experiments for identifying isoform regulation , 2010, Nature Methods.

[33]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[34]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[35]  James B. Brown,et al.  Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation , 2011, Proceedings of the National Academy of Sciences.

[36]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[37]  A. Conesa,et al.  Differential expression in RNA-seq: a matter of depth. , 2011, Genome research.

[38]  Guorong Xu,et al.  iQuant: A fast yet accurate GUI tool for transcript quantification , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[39]  Tao Jiang,et al.  Inference of Isoforms from Short Sequence Reads , 2010, RECOMB.

[40]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[41]  Steven J. M. Jones,et al.  De novo assembly and analysis of RNA-seq data , 2010, Nature Methods.

[42]  Yu Zhu,et al.  Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq , 2012, Bioinform..

[43]  Gunnar Rätsch,et al.  rQuant.web: a tool for RNA-Seq-based transcript quantitation , 2010, Nucleic Acids Res..

[44]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[45]  Martin Vingron,et al.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels , 2012, Bioinform..

[46]  Tin Chi Nguyen,et al.  SPATA: A seeding and patching algorithm for de novo transcriptome assembly , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[47]  Xuegong Zhang,et al.  Isoform abundance inference provides a more accurate estimation of gene expression levels in RNA-seq. , 2010, Journal of bioinformatics and computational biology.

[48]  Cole Trapnell,et al.  Computational methods for transcriptome annotation and quantification using RNA-seq , 2011, Nature Methods.