A fuzzy method for RNA-Seq differential expression analysis in presence of multireads

BackgroundWhen the reads obtained from high-throughput RNA sequencing are mapped against a reference database, a significant proportion of them - known as multireads - can map to more than one reference sequence. These multireads originate from gene duplications, repetitive regions or overlapping genes. Removing the multireads from the mapping results, in RNA-Seq analyses, causes an underestimation of the read counts, while estimating the real read count can lead to false positives during the detection of differentially expressed sequences.ResultsWe present an innovative approach to deal with multireads and evaluate differential expression events, entirely based on fuzzy set theory. Since multireads cause uncertainty in the estimation of read counts during gene expression computation, they can also influence the reliability of differential expression analysis results, by producing false positives. Our method manages the uncertainty in gene expression estimation by defining the fuzzy read counts and evaluates the possibility of a gene to be differentially expressed with three fuzzy concepts: over-expression, same-expression and under-expression. The output of the method is a list of differentially expressed genes enriched with information about the uncertainty of the results due to the multiread presence.We have tested the method on RNA-Seq data designed for case-control studies and we have compared the obtained results with other existing tools for read count estimation and differential expression analysis.ConclusionsThe management of multireads with the use of fuzzy sets allows to obtain a list of differential expression events which takes in account the uncertainty in the results caused by the presence of multireads. Such additional information can be used by the biologists when they have to select the most relevant differential expression events to validate with laboratory assays. Our method can be used to compute reliable differential expression events and to highlight possible false positives in the lists of differentially expressed genes computed with other tools.

[1]  James G. R. Gilbert,et al.  The vertebrate genome annotation (Vega) database , 2004, Nucleic Acids Res..

[2]  Lior Pachter,et al.  Near-optimal RNA-Seq quantification , 2015, ArXiv.

[3]  Mick Watson,et al.  Errors in RNA-Seq quantification affect genes of relevance to human disease , 2015, Genome Biology.

[4]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[5]  Shankar Subramaniam,et al.  Evaluation of normalization methods in mammalian microRNA-Seq data. , 2012, RNA.

[6]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[7]  Didier Dubois,et al.  Possibility Theory, Probability Theory and Multiple-Valued Logics: A Clarification , 2001, Annals of Mathematics and Artificial Intelligence.

[8]  Ion I. Mandoiu,et al.  Estimation of alternative splicing isoform frequencies from RNA-Seq data , 2010, Algorithms for Molecular Biology.

[9]  L. Zadeh Fuzzy sets as a basis for a theory of possibility , 1999 .

[10]  W. Pedrycz,et al.  An introduction to fuzzy sets : analysis and design , 1998 .

[11]  Peng Cui,et al.  Dynamic regulation of genome-wide pre-mRNA splicing and stress tolerance by the Sm-like protein LSm5 in Arabidopsis , 2014, Genome Biology.

[12]  Yan Mei,et al.  The RNA-binding protein hnRNPLL induces a T cell alternative splicing program delineated by differential intron retention in polyadenylated RNA , 2014, Genome Biology.

[13]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[14]  T. Aune,et al.  Defective structural RNA processing in relapsing-remitting multiple sclerosis , 2015, Genome Biology.

[15]  M. Gerstein,et al.  RNA-Seq: a revolutionary tool for transcriptomics , 2009, Nature Reviews Genetics.

[16]  Ying Jin,et al.  TEtranscripts: a package for including transposable elements in differential expression analysis of RNA-seq datasets , 2015, Bioinform..

[17]  Thomas L. Madden,et al.  The BLAST Sequence Analysis Tool , 2013 .

[18]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[19]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[20]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[21]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[22]  Geet Duggal,et al.  Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-seq Data using Lightweight-Alignment , 2015 .

[23]  Michaela Frye,et al.  Beyond library size: a field guide to NGS normalization , 2014, bioRxiv.

[24]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[25]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[26]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[27]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[28]  Marc W. Schmid,et al.  Rcount: simple and flexible RNA-Seq read counting , 2015, Bioinform..

[29]  R. Loewe Combinational usage of next generation sequencing and qPCR for the analysis of tumor samples. , 2013, Methods.

[30]  Geoffrey J Faulkner,et al.  A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. , 2008, Genomics.

[31]  Gunnar Rätsch,et al.  MMR: a tool for read multi-mapper resolution , 2015, bioRxiv.

[32]  Robert LIN,et al.  NOTE ON FUZZY SETS , 2014 .