PseudoLasso: leveraging read alignment in homologous regions to correct pseudogene expression estimates via RNASeq

Pseudogenes have long been considered to be nonfunctional segments in the genome, but recent studies have provided evidence to support their novel regulatory roles in biological processes. With the growing interests in pseudogene research, scientists rely on RNA sequencing technology to estimate expression level of pseudogenes at different tissues or cell lines. The major challenge of RNASeq on pseudogene quantification falls in the high sequence similarity between pseudogenes and their homologous parents. Reads can be ambiguously aligned to multiple homologous regions. In this article, we present PseudoLasso, a genome-wide approach to accurately estimate the abundance of pseudogenes and their parents, and correctly align reads to their origins. Our approach focuses on learning read alignment behaviors, and leveraging this knowledge for abundance estimation and alignment correction. Compared to the read count estimates reported by TopHat2, PseudoLasso is able to provide estimates with a reduced error rate of 10-fold.

[1]  M. Zou,et al.  Oncogenic activation of MAP kinase by BRAF pseudogene in thyroid tumors. , 2009, Neoplasia.

[2]  M. Gerstein,et al.  The GENCODE pseudogene resource , 2012, Genome Biology.

[3]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[4]  Cole Trapnell,et al.  Computational methods for transcriptome annotation and quantification using RNA-seq , 2011, Nature Methods.

[5]  Eran Halperin,et al.  Accurate Estimation of Expression Levels of Homologous Genes in RNA-seq Experiments , 2011, J. Comput. Biol..

[6]  E. Punch,et al.  Pseudogenes: pseudo-functional or key regulators in health and disease? , 2011, RNA.

[7]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[8]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[9]  Shwu-Fan Ma,et al.  A transcribed pseudogene of MYLK promotes cell proliferation , 2011, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[10]  Philip D. Butcher,et al.  Comparative and Functional Genomics , 2002, Comparative and Functional Genomics.

[11]  K. Morris,et al.  Transcriptional regulation of Oct4 by a long non-coding RNA antisense to Oct4-pseudogene 5 , 2010, Transcription.

[12]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[13]  Peter D. Tonner,et al.  Detecting transcription of ribosomal protein pseudogenes in diverse human tissues from RNA-seq data , 2012, BMC Genomics.

[14]  A. Strain,et al.  Human haematopoietic stem cells express Oct4 pseudogenes and lack the ability to initiate Oct4 promoter-driven gene expression , 2010, Journal of Negative Results in BioMedicine.

[15]  Techung Lee,et al.  Stem cell regulatory function mediated by expression of a novel mouse Oct4 pseudogene. , 2007, Biochemical and biophysical research communications.

[16]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[17]  Wei Wang,et al.  GeneScissors: a comprehensive approach to detecting and correcting spurious transcriptome inference owing to RNA-seq reads misalignment , 2013, Bioinform..

[18]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[20]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[21]  Eran Halperin,et al.  Accurate Estimation of Expression Levels of Homologous Genes in RNA-seq Experiments , 2010, RECOMB.

[22]  Philipp Kapranov,et al.  Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. , 2007, Genome research.

[23]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[24]  Mark Gerstein,et al.  Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation , 2006, Nucleic Acids Res..

[25]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.