Data-based filtering for replicated high-throughput transcriptome sequencing experiments

Motivation: RNA sequencing is now widely performed to study differential expression among experimental conditions. As tests are performed on a large number of genes, stringent false-discovery rate control is required at the expense of detection power. Ad hoc filtering techniques are regularly used to moderate this correction by removing genes with low signal, with little attention paid to their impact on downstream analyses. Results: We propose a data-driven method based on the Jaccard similarity index to calculate a filtering threshold for replicated RNA sequencing data. In comparisons with alternative data filters regularly used in practice, we demonstrate the effectiveness of our proposed method to correctly filter lowly expressed genes, leading to increased detection power for moderately to highly expressed genes. Interestingly, this data-driven threshold varies among experiments, highlighting the interest of the method proposed here. Availability: The proposed filtering method is implemented in the R package HTSFilter available on Bioconductor. Contact: andrea.rau@jouy.inra.fr Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[2]  M. Stephens,et al.  Sex-specific and lineage-specific alternative splicing in primates. , 2010, Genome research.

[3]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[4]  Eric T. Wang,et al.  An Abundance of Ubiquitously Expressed Genes Revealed by Tissue Transcriptome Sequence Data , 2009, PLoS Comput. Biol..

[5]  Lee T. Sam,et al.  A Comparison of Single Molecule and Amplification Based Sequencing of Cancer Transcriptomes , 2011, PloS one.

[6]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[7]  Marcel H. Schulz,et al.  A Global View of Gene Activity and Alternative Splicing by Deep Sequencing of the Human Transcriptome , 2008, Science.

[8]  Matthew D. Young,et al.  From RNA-seq reads to differential expression results , 2010, Genome Biology.

[9]  Simon C. Potter,et al.  An overview of Ensembl. , 2004, Genome research.

[10]  H. Steven Wiley,et al.  Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling , 2011, Bioinform..

[11]  A. Oshlack,et al.  Transcript length bias in RNA-seq data confounds systems biology , 2009, Biology Direct.

[12]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[13]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[14]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[15]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[16]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[17]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[18]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[19]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology High-Dimensional Regression and Variable Selection Using CAR Scores , 2011 .

[20]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[21]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[22]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[23]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[24]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[25]  Daniel Bottomly,et al.  Evaluating Gene Expression in C57BL/6J and DBA/2J Mouse Striatum Using RNA-Seq and Microarrays , 2011, PloS one.

[26]  L. AuerPaul,et al.  A Two-Stage Poisson Model for Testing RNA-Seq Data , 2011 .

[27]  J. Medrano,et al.  SNP discovery in the bovine milk transcriptome using RNA-Seq technology , 2010, Mammalian Genome.

[28]  K. Hansen,et al.  Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[29]  Xuegong Zhang,et al.  DEGseq: an R package for identifying differentially expressed genes from RNA-seq data , 2010, Bioinform..

[30]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[31]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[32]  C. Bertolotto,et al.  Essential role of microphthalmia transcription factor for DNA replication, mitosis and genomic stability in melanoma , 2011, Oncogene.

[33]  Daniel Andrés Dos Santos,et al.  The Positive Matching Index: A new similarity measure with optimal characteristics , 2010, Pattern Recognit. Lett..