Detecting anomalies in RNA-seq quantification

Algorithms to infer isoform expression abundance from RNA-seq have been greatly improved in accuracy during the past ten years. However, due to incomplete reference transcriptomes, mapping errors, incomplete sequencing bias models, or mistakes made by the algorithm, the quantification model sometimes could not explain all aspects of the input read data, and misquantification can occur. Here, we develop a computational method to detect instances where a quantification model could not thoroughly explain the input. Specifically, our approach identifies transcripts where the read coverage has significant deviations from the expectation. We call these transcripts “expression anomalies”, and they represent instances where the quantification estimates may be in doubt. We further develop a method to attribute the cause of anomalies to either the incompleteness of the reference transcriptome or the algorithmic mistakes, and we show that our method precisely detects misquantifications with both causes. By correcting the misquantifications that are labeled as algorithmic mistakes, the number of false predictions of differentially expressed transcripts can be reduced. Applying anomaly detection to 30 GEUVADIS and 16 Human Body Map samples, we detect 103 genes with potential unannotated isoforms. These genes tend to be longer than average, and contain a very long exon near 3′ end that the unannotated isoform excludes. Anomaly detection is a new approach for investigating the expression quantification problem that may find wider use in other areas of genomics.

[1]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[2]  Mick Watson,et al.  Errors in RNA-Seq quantification affect genes of relevance to human disease , 2015, Genome Biology.

[3]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[4]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[5]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[6]  Oscar Franzén,et al.  PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data , 2019, Database J. Biol. Databases Curation.

[7]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[8]  Ion I Măndoiu,et al.  Bootstrap-based differential gene expression analysis for RNA-Seq data with and without replicates , 2014, BMC Genomics.

[9]  Charlotte Soneson,et al.  A junction coverage compatibility score to quantify the reliability of transcript abundance estimates and annotation catalogs , 2018, Life Science Alliance.

[10]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[11]  Colin N. Dewey,et al.  Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs , 2013, Bioinform..

[12]  C. Klopp,et al.  Compacting and correcting Trinity and Oases RNA-Seq de novo assemblies , 2017, PeerJ.

[13]  Benjamin J. Raphael,et al.  Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin , 2014, Cell.

[14]  L. Coin,et al.  Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads , 2011, Genome Biology.

[15]  R. Irizarry,et al.  Modeling of RNA-seq fragment sequence bias reduces systematic errors in transcript abundance estimation , 2015, Nature Biotechnology.

[16]  Matthew J. Geniza,et al.  Tools for building de novo transcriptome assembly , 2017 .

[17]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[18]  David R. Kelley,et al.  A whole-genome assembly of the domestic cow, Bos taurus , 2009, Genome Biology.

[19]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[20]  S. Kelly,et al.  TransRate: reference-free quality assessment of de novo transcriptome assemblies , 2015, bioRxiv.

[21]  Antti Honkela,et al.  Fast and accurate approximate inference of transcript expression from RNA-seq data , 2014, Bioinform..

[22]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[23]  M. McCarthy,et al.  Human β cell transcriptome analysis uncovers lncRNAs that are tissue-specific, dynamically regulated, and abnormally expressed in type 2 diabetes. , 2012, Cell metabolism.

[24]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[25]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[26]  Carl Kingsford,et al.  Accurate assembly of transcripts through phase-preserving graph decomposition , 2017, Nature Biotechnology.

[27]  Antti Honkela,et al.  Identifying differentially expressed transcripts from RNA-seq data with biological variation , 2011, Bioinform..

[28]  João Pedro de Magalhães,et al.  Gene co-expression analysis for functional classification and gene–disease predictions , 2017, Briefings Bioinform..

[29]  Juliana Costa-Silva,et al.  RNA-Seq differential expression analysis: An extended review and a software tool , 2017, PloS one.