IsoLasso: A LASSO Regression Approach to RNA-Seq Based Transcriptome Assembly - (Extended Abstract)

The new second generation sequencing technology revolutionizes many biology-related research fields and poses various computational biology challenges. One of them is transcriptome assembly based on RNA-Seq data, which aims at reconstructing all full-length mRNA transcripts simultaneously from millions of short reads. In this article, we consider three objectives in transcriptome assembly: the maximization of prediction accuracy, minimization of interpretation, and maximization of completeness. The first objective, the maximization of prediction accuracy, requires that the estimated expression levels based on assembled transcripts should be as close as possible to the observed ones for every expressed region of the genome. The minimization of interpretation follows the parsimony principle to seek as few transcripts in the prediction as possible. The third objective, the maximization of completeness, requires that the maximum number of mapped reads (or ?expressed segments? in gene models) be explained by (i.e., contained in) the predicted transcripts in the solution. Based on the above three objectives, we present IsoLasso, a new RNA-Seq based transcriptome assembly tool. IsoLasso is based on the well-known LASSO algorithm, a multivariate regression method designated to seek a balance between the maximization of prediction accuracy and the minimization of interpretation. By including some additional constraints in the quadratic program involved in LASSO, IsoLasso is able to make the set of assembled transcripts as complete as possible. Experiments on simulated and real RNA-Seq datasets show that IsoLasso achieves, simultaneously, higher sensitivity and precision than the state-of-art transcript assembly tools.

[1]  R. R. Hocking,et al.  Selection of the Best Subset in Regression Analysis , 1967 .

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  Thomas F. Coleman,et al.  Optimization Toolbox User's Guide , 1998 .

[4]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[5]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[6]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[7]  Mikael Gustafsson,et al.  Constructing and Analyzing a Large-Scale Gene-to-Gene Regulatory Network-Lasso-Constrained Inference and Biological Validation , 2005, IEEE ACM Trans. Comput. Biol. Bioinform..

[8]  M. Gustafsson,et al.  Constructing and analyzing a large-scale gene-to-gene regulatory network Lasso-constrained inference and biological validation , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[10]  Jian Huang,et al.  BMC Bioinformatics BioMed Central Methodology article Supervised group Lasso with applications to microarray data , 2007 .

[11]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[12]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[13]  Ryan D. Morin,et al.  Profiling the HeLa S3 transcriptome using randomly primed cDNA and massively parallel short-read sequencing. , 2008, BioTechniques.

[14]  Naomi S. Altman,et al.  Comparison of next generation sequencing technologies for transcriptome characterization , 2009, BMC Genomics.

[15]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[16]  S. Ranade,et al.  Stem cell transcriptome profiling via massive-scale mRNA sequencing , 2008, Nature Methods.

[17]  R. Lister,et al.  Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis , 2008, Cell.

[18]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[19]  I. Goodhead,et al.  Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution , 2008, Nature.

[20]  Yang Jing L1 Regularization Path Algorithm for Generalized Linear Models , 2008 .

[21]  J. Wain,et al.  High-throughput sequencing provides insights into genome variation and evolution in Salmonella Typhi , 2008, Nature Genetics.

[22]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[23]  Eric P. Xing,et al.  A multivariate regression approach to association analysis of a quantitative trait network , 2008, Bioinform..

[24]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[25]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[26]  Hunter B. Fraser,et al.  Ab initio construction of a eukaryotic transcriptome by massively parallel mRNA sequencing , 2009, Proceedings of the National Academy of Sciences.

[27]  Inanç Birol,et al.  De novo transcriptome assembly with ABySS , 2009, Bioinform..

[28]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[29]  M. Marra,et al.  Applications of new sequencing technologies for transcriptome analysis. , 2009, Annual review of genomics and human genetics.

[30]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[31]  W. Wong,et al.  Modeling non-uniformity in short-read rates in RNA-Seq data , 2010, Genome Biology.

[32]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[33]  Marcel H. Schulz,et al.  Prediction of alternative isoforms from exon expression levels in RNA-Seq experiments , 2010, Nucleic acids research.

[34]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature Biotechnology.

[35]  Tao Jiang,et al.  Inference of Isoforms from Short Sequence Reads , 2010, RECOMB.

[36]  Eran Halperin,et al.  Accurate Estimation of Expression Levels of Homologous Genes in RNA-seq Experiments , 2010, RECOMB.

[37]  S. Srivastava,et al.  A two-parameter generalized Poisson model to improve the analysis of RNA-seq data , 2010, Nucleic acids research.

[38]  B. Haas,et al.  Advancing RNA-Seq analysis , 2010, Nature Biotechnology.

[39]  K. Kristiansen,et al.  Corrigendum: Ab initio reconstruction of cell type–specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs , 2010, Nature Biotechnology.

[40]  Y. Xing,et al.  Detection of splice junctions from paired-end RNA-seq data by SpliceMap , 2010, Nucleic acids research.

[41]  J. Rinn,et al.  Ab initio reconstruction of transcriptomes of pluripotent and lineage committed cells reveals gene structures of thousands of lincRNAs , 2010, Nature biotechnology.

[42]  Sanghyuk Lee,et al.  Accurate quantification of transcriptome from RNA-Seq data by effective length normalization , 2010, Nucleic Acids Res..

[43]  Tao Jiang,et al.  Inference of Isoforms from Short Sequence Reads , 2010, RECOMB.

[44]  M. Borodovsky,et al.  TrueSight: a new algorithm for splice junction detection using RNA-seq , 2012, Nucleic acids research.

[45]  Wing Hung Wong,et al.  Simultaneous Isoform Discovery and Quantification from RNA-Seq , 2013, Statistics in biosciences.

[46]  C. Donati,et al.  Reverse vaccinology in the 21st century: improvements over the original design , 2013, Annals of the New York Academy of Sciences.